Give me good data, or give me death
With those guidelines in mind, I could start to outline what the process would look like:
- Fetch data from vendor (csv form)
- Clean the data
- Write the data on disk
The biggest decision I had to make at this stage was the format used to store the data. Based on the requirements listed above, I shortlisted a few formats that I thought would fit my purpose: csv, json, hdf5, and msgpack.
At this stage I wanted to get a feel for the performance of each of the options. In order to do that I created a simple dataset of 1M millisecond bars so 4M observations.
import pandas as pd import numpy as np #create sizable dataset n_obs = 1000000 idx = pd.date_range('2015-01-01', periods=n_obs, freq='L') df = pd.DataFrame(np.random.randn(n_obs,4), index=idx, columns=["Open", "High", "Low", "Close"]) df.head()
1 loops, best of 3: 8.34 s per loop
1 loops, best of 3: 606 ms per loop
%timeit df.to_hdf("hdf_format", "df", mode="w")
1 loops, best of 3: 102 ms per loop
10 loops, best of 3: 143 ms per loop
1 loops, best of 3: 971 ms per loop
1 loops, best of 3: 6.05 s per loop
%timeit pd.read_hdf("hdf_format", "df")
100 loops, best of 3: 11.3 ms per loop
10 loops, best of 3: 33.1 ms per loop
So at this point I have decided on a format. The question that remains to be answered is how to organize it. I was thinking of something like this:
/data |-- Equities |-- Stock (e.g. SPY, AAPL etc.) |-- Metadata |-- Forex |-- Cross (e.g. USDCAD, USDJPY etc.) |-- Metadata |-- Aggregated data |-- Exchanges (e.g. IdealPRO etc.) |-- Futures |-- Exchange (e.g. CME, ICE, etc.) |-- Contract (e.g. ES, CL etc.) |-- Metadata |-- Continuously rolled contract |-- Expiry (e.g. F6, G6, H6 etc.)
Personally not too sure how to best do this. It would seem to me that it would be rather difficult to design a clean polymorphic API to access the data with such a structure but I can’t seem to find a better way.
I would like to hear what readers have come up with to address those problems. In addition to how you store and organize your data, I am very keen to hear how you would handle automating the creation of perpetual contracts without having to manually write a rule for the roll of each product. This has proven to be a tricky task for me and since I use those contracts a lot in my analysis I am looking for a good solution.
Hopefully this discussion will be of interest to readers that manage their own data in-house.