Very slow hdf5 loading #2425

OWissett · 2024-05-07T17:04:50Z

To assist reproducing bugs, please include the following:

Operating System: Ubuntu 20.04.2 LTS
Python version: 3.11.9
Where Python was acquired: conda
h5py version: 3.9
HDF5 version: 1.12.2

I have two relatively large HDF5 files, one is 16GB and the other 30GB. Both files follow the same basic schema:

/a/b/c where c is a group which then contains several datasets (always the same dataset names and ranks, but with different data).

One file loads perfectly fine, and the other is very slow. Here is a histogram of the loading times:

Here is my profiling code:

import json
import h5py


def time_search(dataset, splits):
    with open(splits) as f:
        split_dict = json.load(f)

    keys = list(split_dict["train"])[:1_000]

    dataset = h5py.File(dataset, "r")

    access_times = []

    uses_compression = []

    for key in keys:
        start = time.time()
        _ = dataset[key]
        end = time.time()
        access_times.append(end - start)

    return access_times

I am not able to provide the actual data as it is confidential.

There are approximately 3 million c level groups in mine, and 2 million c level groups in Matt's. Both files are located on the same filesystem and the same NVMe drive. Neither uses chunked data (all datasets are CONTIG), or compression.

Both files were created in a nearly identical way, using the rust hdf5 crate, and then using h5py to filter out datasets using h5py.copy. Mine did have an additional step where I had to rename some of the groups using h5py.move.

Does anyone have any idea why this might be happening?

I have checked everything mentioned on google or by ChatGPT.

The text was updated successfully, but these errors were encountered:

ajelenak · 2024-05-07T17:28:01Z

I have checked everything mentioned on google or by ChatGPT.

This is not really helpful. Can you list at least some of those suggestions?

Are /a/b/ in /a/b/c always the same? So there are only a couple of millions c groups?

OWissett · 2024-05-07T19:23:16Z

Yeah sorry, wasnt clear.

Mine
So we have ~300k top level groups (a). 3 million second level groups (b) in total (not per a). 2 c level groups per b.

Matt's
37k top level groups. 1.8 million second level groups (b) in total (not per a). 2 c level groups per b.

I made a mistake in the original. In the c level groups they each have 4 datasets. The number of b groups may vary quite a bit between a groups.

What I have checked is the:

File fragmentation
Compression
Which drive they are on
Whether the data was chunked (it isn't)

I have also tested with a fresh conda environment.

takluyver · 2024-05-08T09:43:07Z

In the code you show, it looks like you're not actually loading the data. _ = dataset[key] looks up the dataset in the file, but it doesn't read it - datasets can be large, so h5py makes it easy to read part of a dataset. If you do want to try reading the data, change this line to something like:

# Allocate a new array, read whole dataset
_ = dataset[key][()]

# OR reuse an existing array of the right shape & type
dataset[key].read_direct(arr)

I would guess that this would make the difference much less dramatic.

It's still interesting that you see such a big difference in looking up datasets, though. I could make wild guesses about what might be going on, but I don't have any great ideas. I suspect there's not much we can work out unless you can share enough detail about creating the files for other people to replicate it.

OWissett · 2024-05-08T09:47:03Z

I am not able to provide details about the files due to it containing unpublished research data.

takluyver · 2024-05-08T09:56:23Z

I understand that, but it's not unusual for people to reproduce issues like this by writing code to produce similar files with meaningless data.

tacaswell · 2024-05-08T15:21:47Z

I suspect that this is fundementaally a duplicate of #1055 where very wide hdf5 files have poor performance.

I agree we need a script that generates representative files to do much more debugging (if I am correct about the issue here, the exact names and actual data values are irrelevant).

ajelenak · 2024-05-08T18:54:51Z

Would be interesting to test if Group.visititems() helps. This should iterate over all the groups only once.

Why so many groups in the file? Do they hold some kind of data in their names? If so, a "table of content" dataset could help to go straight to the desired group and its datasets. The TOC dataset would combine all the data in group names that is queried plus an object reference to the group/dataset that is the result of a specific query. So for a group path /data_a/data_b/data_c/, a TOC dataset entry would be [data_a, data_b, data_c, <objref to /data_a/data_b/data_c/>].

OWissett · 2024-05-10T16:06:29Z

The groups do hold data in their names. The data in question is going to be used to train machine learning models with the name of the group being the name of the data entry. I didn't realise that wide files have poor performance. Is it better practice to create fewer groups with larger datasets in them?

Thanks for the advice. I will try and get you a sample version of the file next week. Sorry for the difficulties with this.

Many thanks.

ajelenak · 2024-05-10T16:35:19Z

I didn't realise that wide files have poor performance. Is it better practice to create fewer groups with larger datasets in them?

It is more accurate to say that default libhdf5 settings may not apply well to your files. There are some storage features that could help but they need to be applied at file creation.

HDF Group holds regular weekly live Q&A sessions on Tuesdays at 1:20pm US Eastern time. If you can, I think bringing up this use case would be a very nice opportunity to get immediate feedback from HDFG staff on possible solutions to try. There's nothing special to prepare, just show up to explain the problem and answer any question about the files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow hdf5 loading #2425

Very slow hdf5 loading #2425

OWissett commented May 7, 2024

ajelenak commented May 7, 2024 •

edited

OWissett commented May 7, 2024 •

edited

takluyver commented May 8, 2024

OWissett commented May 8, 2024

takluyver commented May 8, 2024

tacaswell commented May 8, 2024

ajelenak commented May 8, 2024

OWissett commented May 10, 2024

ajelenak commented May 10, 2024 •

edited

Very slow hdf5 loading #2425

Very slow hdf5 loading #2425

Comments

OWissett commented May 7, 2024

ajelenak commented May 7, 2024 • edited

OWissett commented May 7, 2024 • edited

takluyver commented May 8, 2024

OWissett commented May 8, 2024

takluyver commented May 8, 2024

tacaswell commented May 8, 2024

ajelenak commented May 8, 2024

OWissett commented May 10, 2024

ajelenak commented May 10, 2024 • edited

ajelenak commented May 7, 2024 •

edited

OWissett commented May 7, 2024 •

edited

ajelenak commented May 10, 2024 •

edited