New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with dataset fields
function
#2374
Comments
I'm presuming that if you have a 300 GB file then your datasets are all chunked (compressed), this is correct? If so, it's really important to understand what exactly a compound type is in its lowest level. If you're dataset is a compound type and it's shape is Nx1, what we have is essentially a an array of structs. When you request a field, hdf5 must decompress each struct, i.e. every field, extract that single field and then return it to you. It does this process for each "chunk" of compressed data in that array. So if you have 10000 elements, i.e. 10000x1 array, and have a chunk size of 1000x1, then it's reading 10 chunks, decompressing 1000 elements at a time, pulling that singular field, and then concatenating it all together. I strongly suggest you change your data type if you want to be fast at reading in a small subset of "columns". The opposite end of a compound type is a group of datasets, where each would be field of the compound type is its own dataset and compressed by itself. If you really want to optimize it, you'll group columns by data type and then of the columns with same data type, if they're commonly read together then you'll compress them together. I.e. you have a table like this x, y, x, t, id, msg float, float, float, float, int, char xyz represents positions, you always pull the full vector so you make a Nx3 array and compress that together. Time goes by itself, as does id, and msg |
On the memory issue side, internally hdf5 will cache each chunk and it's associated metadata, on the chance that you'll want to read the same chunk twice. If you know you'll never do this, you can set this cache to 0 when opening the file. I suggest reading these over |
I have found a strange behaviour with the
flields
function of the dataset object. I have a fairly large (~300Gb) file with compound data. I try to read the uint64 time stamp of all the data in a dataset of compound data. The resulting data is only a few Mb. I did the following:This works and i get a resulting numpy array with a size of a few Mb. But this process takes about 3 min and completely fills my system memory. The memory keeps full after the with block. I tried to regain it with
gc.collect()
, which did nothing. I have found a "workaround":This is expectedly still not fast, but faster than the first attempt and it keeps memory consumption low.
Summary of the h5py configuration
h5py 3.10.0
HDF5 1.14.2
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]
sys.platform win32
sys.maxsize 9223372036854775807
numpy 1.26.3
cython (built with) 0.29.36
numpy (built against) 1.23.2
HDF5 (built against) 1.14.2
The text was updated successfully, but these errors were encountered: