Incremental writes to hdf5 with h5py
I have got a question about how best to write to hdf5 files with python / h5py.
I have data like:
-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178 | 10 | 12 | ...
-----------------------------------------
| 179 | 12 | 11 | ...
-----------------------------------------
| 185 | 9 | 12 | ...
-----------------------------------------
| 187 | 15 | 12 | ...
...
with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).
With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.
I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.
Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?
I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.
dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))
and then when another block of 10^4 rows arrives, replace the dataset?
Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).
I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.
Solution 1:
Per the FAQ, you can expand the dataset using dset.resize
. For example,
import os
import h5py
import numpy as np
path = '/tmp/out.h5'
os.remove(path)
with h5py.File(path, "a") as f:
dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),
dtype='i8', chunks=(10**4,))
dset[:] = np.random.random(dset.shape)
print(dset.shape)
# (100000,)
for i in range(3):
dset.resize(dset.shape[0]+10**4, axis=0)
dset[-10**4:] = np.random.random(10**4)
print(dset.shape)
# (110000,)
# (120000,)
# (130000,)
Solution 2:
As @unutbu pointed out, dset.resize
is an excellent option. It may be work while to look at pandas
and its HDF5 support which may be useful given your workflow. It sounds like HDF5 is a reasonable choice given your needs but it is possible that your problem may be expressed better using an additional layer on top.
One big thing to consider is the orientation of the data. If you're primarily interested in reads, and you are primarily fetching data by column, then it sounds like you may want to transpose the data such that the reads can happen by row as HDF5 stores in row-major order.