With those guidelines in mind, I could start to outline what the process would look like:
- Fetch data from vendor (csv form)
- Clean the data
- Write the data on disk
The biggest decision I had to make at this stage was the format used to store the data. Based on the requirements listed above, I shortlisted a few formats that I thought would fit my purpose: csv, json, hdf5, and msgpack.
At this stage I wanted to get a feel for the performance of each of the options. In order to do that I created a simple dataset of 1M millisecond bars so 4M observations.
import pandas as pd
import numpy as np
#create sizable dataset
n_obs = 1000000
idx = pd.date_range('2015-01-01', periods=n_obs, freq='L')
df = pd.DataFrame(np.random.randn(n_obs,4), index=idx,
columns=["Open", "High", "Low", "Close"])
df.head()
%timeit df.to_csv("csv_format")
%timeit df.to_json("json_format")
%timeit df.to_hdf("hdf_format", "df", mode="w")
%timeit df.to_msgpack("msgpack_format")
%timeit pd.read_csv("csv_format")
%timeit pd.read_json("json_format")
%timeit pd.read_hdf("hdf_format", "df")
%timeit pd.read_msgpack("msgpack_format")
So at this point I have decided on a format. The question that remains to be answered is how to organize it. I was thinking of something like this:
/data
|-- Equities
|-- Stock (e.g. SPY, AAPL etc.)
|-- Metadata
|-- Forex
|-- Cross (e.g. USDCAD, USDJPY etc.)
|-- Metadata
|-- Aggregated data
|-- Exchanges (e.g. IdealPRO etc.)
|-- Futures
|-- Exchange (e.g. CME, ICE, etc.)
|-- Contract (e.g. ES, CL etc.)
|-- Metadata
|-- Continuously rolled contract
|-- Expiry (e.g. F6, G6, H6 etc.)
Personally not too sure how to best do this. It would seem to me that it would be rather difficult to design a clean polymorphic API to access the data with such a structure but I can’t seem to find a better way.
I would like to hear what readers have come up with to address those problems. In addition to how you store and organize your data, I am very keen to hear how you would handle automating the creation of perpetual contracts without having to manually write a rule for the roll of each product. This has proven to be a tricky task for me and since I use those contracts a lot in my analysis I am looking for a good solution.
Hopefully this discussion will be of interest to readers that manage their own data in-house.
In store the data in original source format – whatever the broker/vendor gave me (csv, binary, json, …)
Then I import it in my own formats (gzipped msgpack/json/numpy/custom) depending on need.
Why store the original source format? Sometimes it changes, and if you don’t store the data as you received it, your parser might fail to parse it and thus you’ll have data holes. Also, sometimes I will want to extract more information from the source then when I originally import it, or I may find errors in the parser, and having the original data allows a reparse.
Also, read this:
http://cyrille.rossant.net/moving-away-hdf5/
Thanks for commenting Gosu. I should have been more clear i always store the original data for all the reasons you mentioned. It is always good to be able to go back to the original if something looks incorrect.
Also thank you for the article it was a very interesting read. Problem with formats is that for whatever format i pick it seems i can find an equivalent article essentially saying it doesn’t work. I think if there was a perfect data format there wouldn’t be any need to keep coming up with new ones and everybody would be using the same. I’ll give hdf a shot and report back if i encounter some of t same issues.
QF