Saving to hdf5 is very slow (Python freezing)
I'm trying to save bottleneck values to a newly created hdf5 file.
The bottleneck values come in batches of shape (120,10,10, 2048)
.
Saving one alone batch is taking up more than 16 gigs and python seems to be freezing at that one batch. Based on recent findings (see update, it seems hdf5 taking up large memory is okay, but the freezing part seems to be a glitch.
I'm only trying to save the first 2 batches for test purposes and only the training data set (once again,this is a test run), but I can't even get past the first batch. It just stalls at the first batch and doesn't loop to the next iteration. If I try to check the hdf5, explorer will get sluggish, and Python will freeze. If I try to kill Python (even with out checking hdf5 file), Python doesn't close properly and it forces a restart.
Here is the relevant code and data:
Total data points are about 90,000 ish, released in batches of 120.
Bottleneck shape is (120,10,10,2048)
So the first batch I'm trying to save is (120,10,10,2048)
Here is how I tried to save the dataset:
with h5py.File(hdf5_path, mode='w') as hdf5:
hdf5.create_dataset("train_bottle", train_shape, np.float32)
hdf5.create_dataset("train_labels", (len(train.filenames), params['bottle_labels']),np.uint8)
hdf5.create_dataset("validation_bottle", validation_shape, np.float32)
hdf5.create_dataset("validation_labels",
(len(valid.filenames),params['bottle_labels']),np.uint8)
#this first part above works fine
current_iteration = 0
print('created_datasets')
for x, y in train:
number_of_examples = len(train.filenames) # number of images
prediction = model.predict(x)
labels = y
print(prediction.shape) # (120,10,10,2048)
print(y.shape) # (120, 12)
print('start',current_iteration*params['batch_size']) # 0
print('end',(current_iteration+1) * params['batch_size']) # 120
hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
current_iteration += 1
print(current_iteration)
if current_iteration == 3:
break
This is the output of the print statements:
(90827, 10, 10, 2048) # print(train_shape)
(6831, 10, 10, 2048) # print(validation_shape)
created_datasets
(120, 10, 10, 2048) # print(prediction.shape)
(120, 12) #label.shape
start 0 #start of batch
end 120 #end of batch
# Just stalls here instead of printing `print(current_iteration)`
It just stalls here for while (20 mins +), and the hdf5 file slowly grows in size (around 20 gigs now, before I force kill). Actually I can't even force kill with task manager, I have to restart the OS, to actually kill Python in this case.
Update
After playing around with my code for a bit, there seems to be a strange bug/behavior.
The relevant part is here:
hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
If I run either of these lines, my script will go through the iterations, and automatically break as expected. So there is no freeze if I run either-or. It happens fairly quickly as well -- less than one min.
If I run the first line ('train_bottle')
, my memory is taking up about 69-72 gigs, even if it's only a couple of batches. If I try more batches, the memory is the same. So I'm assuming the train_bottle
decided storage based on the size parameters I'm assigning the dataset, and not actually when it gets filled.
So despite the 72 gigs, it's running fairly quickly (one min).
If I run the second line, train_labels
, my memory takes up a few megabytes.
There is no problem with the iterations, and break statement is executed.
However, now here is the problem, If I try to run both lines (which in my case is necessary as I need to save both 'train_bottle' and 'train_labels'), I'm experiencing a freeze on the first iteration, and it doesn't continue to the second iteration, even after 20 mins. The Hdf5 file is slowly growing, but if I try to access it, Windows Explorer slows down to a snail and I can't close Python -- I have to restart the OS.
So I'm not sure what the problem is when trying to running both lines -- as if I run the memory hungry train_data
line, if works perfectly and ends within a min.
Writing Data to HDF5
If you write to a chunked dataset without specifying a chunkshape, h5py will do that automaticly for you. Since h5py can't know how do you wan't to write or read the data from the dataset, this will often end up in a bad performance.
You also use the default chunk-cache-size of 1 MB. If you only write to a part of a chunk and the chunk doesn't fit in the cache (which is very likely with 1MP chunk-cache-size), the whole chunk will be read in memory, modified and written back to disk. If that happens multiple times you will see a performance which is far beyond the sequential IO-speed of your HDD/SSD.
In the following example I assume that you only read or write along your first dimension. If not this has to be modified to your needs.
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path='Test.h5'
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
#prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction
f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))
Edit The data creation in the loop took quite a lot of time, so I create the data before the time measurement.
This should give at least 900 MB/s throuput (CPU limited). With real data and lower compression ratios, you should easily reach the sequential IO-speed of your harddisk.
Open a HDF5-File with the with statement can also lead to bad performance if you make the mistake to call this block multiple times. This would close and reopen the file, deleting the chunk-cache.
For determination of the right chunk-size I would also recommend: https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774
If you have enough DDR memory and want extremely fast data loading&saving performance, please use np.load()&np.save() directly. https://stackoverflow.com/a/49046312/2018567 np.load()&np.save() could provide you fastest data loading and saving performance, so far, I couldn't find any other tools or framework could compete it, even HDF5's performance is only 1/5 ~ 1/7 of it.
This answer is more like a comment on the argument between @max9111 and @Clock ZHONG. I wrote this for helping other people wondering which is faster HDF5 or np.save().
I used the code provided by @max9111 and modified it as suggested by @Clock ZHONG. The exact jupyter notebook can be found at https://github.com/wornbb/save_speed_test.
In short, with my spec:
- SSD: Samsung 960 EVO
- CPU: i7-7700K
- RAM: 2133 MHz 16GB
- OS: Win 10
HDF5 achieves 1339.5 MB/s while np.save is only 924.9 MB/s (without compression).
Also, as noted by @Clock ZHONG, he/she had a problem with lzf -Filter. If you also have this problem, posted jupyter notebook can be run with conda distribution of python3 with pip installed packages on win 10.