Memory leak with dataset `fields` function #2374

robotanik · 2024-02-07T09:18:16Z

I have found a strange behaviour with the flields function of the dataset object. I have a fairly large (~300Gb) file with compound data. I try to read the uint64 time stamp of all the data in a dataset of compound data. The resulting data is only a few Mb. I did the following:

with h5py.File('path/to/file.hdf5', mode='r') as f:
    data = f['dataset_name'].fields('time')[:]
array = numpy.stack(data)

This works and i get a resulting numpy array with a size of a few Mb. But this process takes about 3 min and completely fills my system memory. The memory keeps full after the with block. I tried to regain it with gc.collect(), which did nothing. I have found a "workaround":

data = []
with h5py.File('path/to/file.hdf5', mode='r') as f:
    for event in f['dataset_name']:
        data.append(event['time'])
array = numpy.stack(data)

This is expectedly still not fast, but faster than the first attempt and it keeps memory consumption low.

Summary of the h5py configuration

h5py 3.10.0
HDF5 1.14.2
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)]
sys.platform win32
sys.maxsize 9223372036854775807
numpy 1.26.3
cython (built with) 0.29.36
numpy (built against) 1.23.2
HDF5 (built against) 1.14.2

The text was updated successfully, but these errors were encountered:

Delengowski · 2024-02-10T15:28:21Z

I'm presuming that if you have a 300 GB file then your datasets are all chunked (compressed), this is correct?

If so, it's really important to understand what exactly a compound type is in its lowest level.

If you're dataset is a compound type and it's shape is Nx1, what we have is essentially a an array of structs. When you request a field, hdf5 must decompress each struct, i.e. every field, extract that single field and then return it to you. It does this process for each "chunk" of compressed data in that array.

So if you have 10000 elements, i.e. 10000x1 array, and have a chunk size of 1000x1, then it's reading 10 chunks, decompressing 1000 elements at a time, pulling that singular field, and then concatenating it all together.

I strongly suggest you change your data type if you want to be fast at reading in a small subset of "columns".

The opposite end of a compound type is a group of datasets, where each would be field of the compound type is its own dataset and compressed by itself.

If you really want to optimize it, you'll group columns by data type and then of the columns with same data type, if they're commonly read together then you'll compress them together.

I.e. you have a table like this

x, y, x, t, id, msg

float, float, float, float, int, char

xyz represents positions, you always pull the full vector so you make a Nx3 array and compress that together. Time goes by itself, as does id, and msg

Delengowski · 2024-02-10T15:31:08Z

On the memory issue side, internally hdf5 will cache each chunk and it's associated metadata, on the chance that you'll want to read the same chunk twice. If you know you'll never do this, you can set this cache to 0 when opening the file.

I suggest reading these over

https://docs.h5py.org/en/stable/high/file.html#chunk-cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with dataset `fields` function #2374

Memory leak with dataset `fields` function #2374

robotanik commented Feb 7, 2024

Delengowski commented Feb 10, 2024

Delengowski commented Feb 10, 2024

Memory leak with dataset fields function #2374

Memory leak with dataset fields function #2374

Comments

robotanik commented Feb 7, 2024

Summary of the h5py configuration

Delengowski commented Feb 10, 2024

Delengowski commented Feb 10, 2024

Memory leak with dataset `fields` function #2374

Memory leak with dataset `fields` function #2374