Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with dataset fields function #2374

Open
robotanik opened this issue Feb 7, 2024 · 2 comments
Open

Memory leak with dataset fields function #2374

robotanik opened this issue Feb 7, 2024 · 2 comments

Comments

@robotanik
Copy link

I have found a strange behaviour with the flields function of the dataset object. I have a fairly large (~300Gb) file with compound data. I try to read the uint64 time stamp of all the data in a dataset of compound data. The resulting data is only a few Mb. I did the following:

with h5py.File('path/to/file.hdf5', mode='r') as f:
    data = f['dataset_name'].fields('time')[:]
array = numpy.stack(data)

This works and i get a resulting numpy array with a size of a few Mb. But this process takes about 3 min and completely fills my system memory. The memory keeps full after the with block. I tried to regain it with gc.collect(), which did nothing. I have found a "workaround":

data = []
with h5py.File('path/to/file.hdf5', mode='r') as f:
    for event in f['dataset_name']:
        data.append(event['time'])
array = numpy.stack(data)

This is expectedly still not fast, but faster than the first attempt and it keeps memory consumption low.

Summary of the h5py configuration

h5py 3.10.0
HDF5 1.14.2
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]
sys.platform win32
sys.maxsize 9223372036854775807
numpy 1.26.3
cython (built with) 0.29.36
numpy (built against) 1.23.2
HDF5 (built against) 1.14.2

@Delengowski
Copy link
Contributor

I'm presuming that if you have a 300 GB file then your datasets are all chunked (compressed), this is correct?

If so, it's really important to understand what exactly a compound type is in its lowest level.

If you're dataset is a compound type and it's shape is Nx1, what we have is essentially a an array of structs. When you request a field, hdf5 must decompress each struct, i.e. every field, extract that single field and then return it to you. It does this process for each "chunk" of compressed data in that array.

So if you have 10000 elements, i.e. 10000x1 array, and have a chunk size of 1000x1, then it's reading 10 chunks, decompressing 1000 elements at a time, pulling that singular field, and then concatenating it all together.

I strongly suggest you change your data type if you want to be fast at reading in a small subset of "columns".

The opposite end of a compound type is a group of datasets, where each would be field of the compound type is its own dataset and compressed by itself.

If you really want to optimize it, you'll group columns by data type and then of the columns with same data type, if they're commonly read together then you'll compress them together.

I.e. you have a table like this

x, y, x, t, id, msg

float, float, float, float, int, char

xyz represents positions, you always pull the full vector so you make a Nx3 array and compress that together. Time goes by itself, as does id, and msg

@Delengowski
Copy link
Contributor

On the memory issue side, internally hdf5 will cache each chunk and it's associated metadata, on the chance that you'll want to read the same chunk twice. If you know you'll never do this, you can set this cache to 0 when opening the file.

I suggest reading these over

https://docs.h5py.org/en/stable/high/file.html#chunk-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants