Give me good data, or give me death

A good discussion not to long ago led me to start a revolution against some data management aspects of my technology stack. Indeed it is one of the areas where the decisions made will impact every project undertaken down the road. Time is one of our most valuable resources and we need to minimize the amount of it we have to spend dealing with data issues. Messy and/or hard to use data is the greatest drag I have encountered when trying to produce research.I had to keep a couple things in mind when deciding on a solution. First, I knew I did not want to depend on any database software. I also knew that I would not be the only one using that data and that although I use Python, other potential users still don’t know better and use R. The ideal solution would be as close to language agnostic as possible. Furthermore, I wanted a solution stable enough that I did not have to worry too much about backward compatibility in case of future upgrade.

With those guidelines in mind, I could start to outline what the process would look like:

  1. Fetch data from vendor (csv form)
  2. Clean the data
  3. Write the data on disk

The biggest decision I had to make at this stage was the format used to store the data. Based on the requirements listed above, I shortlisted a few formats that I thought would fit my purpose: csv, json, hdf5, and msgpack.

At this stage I wanted to get a feel for the performance of each of the options. In order to do that I created a simple dataset of 1M millisecond bars so 4M observations.

In [1]:
import pandas as pd
import numpy as np

#create sizable dataset
n_obs = 1000000
idx = pd.date_range('2015-01-01', periods=n_obs, freq='L')
df = pd.DataFrame(np.random.randn(n_obs,4), index=idx, 
                  columns=["Open", "High", "Low", "Close"])
df.head()
Out[1]:
Open High Low Close
2015-01-01 00:00:00.000 -0.317677 -0.535562 -0.506776 1.545908
2015-01-01 00:00:00.001 1.370362 1.549984 -0.720097 -0.653726
2015-01-01 00:00:00.002 0.109728 0.242318 1.375126 -0.509934
2015-01-01 00:00:00.003 0.661626 0.861293 -0.322655 -0.207168
2015-01-01 00:00:00.004 -0.587584 -0.980942 0.132920 0.963745
Let’s now see how they perform for writing.
In [2]:
%timeit df.to_csv("csv_format")
1 loops, best of 3: 8.34 s per loop
In [3]:
%timeit df.to_json("json_format")
1 loops, best of 3: 606 ms per loop
In [4]:
%timeit df.to_hdf("hdf_format", "df", mode="w")
1 loops, best of 3: 102 ms per loop
In [5]:
%timeit df.to_msgpack("msgpack_format")
10 loops, best of 3: 143 ms per loop
And finally let’s have a look at their read performance.
In [11]:
%timeit pd.read_csv("csv_format")
1 loops, best of 3: 971 ms per loop
In [10]:
%timeit pd.read_json("json_format")
1 loops, best of 3: 6.05 s per loop
In [8]:
%timeit pd.read_hdf("hdf_format", "df")
100 loops, best of 3: 11.3 ms per loop
In [9]:
%timeit pd.read_msgpack("msgpack_format")
10 loops, best of 3: 33.1 ms per loop
Based on that quick and dirty analysis HDF seems to do better. Read performance is much more important to me as the data should only be written once but will definitely be read more than that. Please not that I did not intend portray this test a end-all discussion proof. But simply to look at what the options were and to evaluate their relative performance.Based on my preliminary results including but not limited to this analysis, I elected to store the data using the HDF format as it meets all my requirements and looks to be fairly fast, at least for medium size data. It should also enable the R homies to use it through the excellent rhdf5 library.

So at this point I have decided on a format. The question that remains to be answered is how to organize it. I was thinking of something like this:

/data
|-- Equities
    |-- Stock (e.g. SPY, AAPL etc.)
        |-- Metadata
|-- Forex
    |-- Cross (e.g. USDCAD, USDJPY etc.)
        |-- Metadata
        |-- Aggregated data
        |-- Exchanges (e.g. IdealPRO etc.)
|-- Futures
    |-- Exchange (e.g. CME, ICE, etc.)
        |-- Contract (e.g. ES, CL etc.)
            |-- Metadata
            |-- Continuously rolled contract
            |-- Expiry (e.g. F6, G6, H6 etc.)

Personally not too sure how to best do this. It would seem to me that it would be rather difficult to design a clean polymorphic API to access the data with such a structure but I can’t seem to find a better way.

I would like to hear what readers have come up with to address those problems. In addition to how you store and organize your data, I am very keen to hear how you would handle automating the creation of perpetual contracts without having to manually write a rule for the roll of each product. This has proven to be a tricky task for me and since I use those contracts a lot in my analysis I am looking for a good solution.

Hopefully this discussion will be of interest to readers that manage their own data in-house.

2 thoughts on “Give me good data, or give me death”

  1. In store the data in original source format – whatever the broker/vendor gave me (csv, binary, json, …)

    Then I import it in my own formats (gzipped msgpack/json/numpy/custom) depending on need.

    Why store the original source format? Sometimes it changes, and if you don’t store the data as you received it, your parser might fail to parse it and thus you’ll have data holes. Also, sometimes I will want to extract more information from the source then when I originally import it, or I may find errors in the parser, and having the original data allows a reparse.

    Also, read this:

    http://cyrille.rossant.net/moving-away-hdf5/

    1. Thanks for commenting Gosu. I should have been more clear i always store the original data for all the reasons you mentioned. It is always good to be able to go back to the original if something looks incorrect.

      Also thank you for the article it was a very interesting read. Problem with formats is that for whatever format i pick it seems i can find an equivalent article essentially saying it doesn’t work. I think if there was a perfect data format there wouldn’t be any need to keep coming up with new ones and everybody would be using the same. I’ll give hdf a shot and report back if i encounter some of t same issues.

      QF

Comments are closed.