Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow hdf5 loading #2425

Open
OWissett opened this issue May 7, 2024 · 9 comments
Open

Very slow hdf5 loading #2425

OWissett opened this issue May 7, 2024 · 9 comments

Comments

@OWissett
Copy link

OWissett commented May 7, 2024

To assist reproducing bugs, please include the following:

  • Operating System: Ubuntu 20.04.2 LTS
  • Python version: 3.11.9
  • Where Python was acquired: conda
  • h5py version: 3.9
  • HDF5 version: 1.12.2

I have two relatively large HDF5 files, one is 16GB and the other 30GB. Both files follow the same basic schema:

/a/b/c where c is a group which then contains several datasets (always the same dataset names and ranks, but with different data).

One file loads perfectly fine, and the other is very slow. Here is a histogram of the loading times:

image

Here is my profiling code:

import json
import h5py


def time_search(dataset, splits):
    with open(splits) as f:
        split_dict = json.load(f)

    keys = list(split_dict["train"])[:1_000]

    dataset = h5py.File(dataset, "r")

    access_times = []

    uses_compression = []

    for key in keys:
        start = time.time()
        _ = dataset[key]
        end = time.time()
        access_times.append(end - start)

    return access_times

I am not able to provide the actual data as it is confidential.

There are approximately 3 million c level groups in mine, and 2 million c level groups in Matt's. Both files are located on the same filesystem and the same NVMe drive. Neither uses chunked data (all datasets are CONTIG), or compression.

Both files were created in a nearly identical way, using the rust hdf5 crate, and then using h5py to filter out datasets using h5py.copy. Mine did have an additional step where I had to rename some of the groups using h5py.move.

Does anyone have any idea why this might be happening?

I have checked everything mentioned on google or by ChatGPT.

@ajelenak
Copy link
Contributor

ajelenak commented May 7, 2024

I have checked everything mentioned on google or by ChatGPT.

This is not really helpful. Can you list at least some of those suggestions?

Are /a/b/ in /a/b/c always the same? So there are only a couple of millions c groups?

@OWissett
Copy link
Author

OWissett commented May 7, 2024

Yeah sorry, wasnt clear.

Mine
So we have ~300k top level groups (a). 3 million second level groups (b) in total (not per a). 2 c level groups per b.

Matt's
37k top level groups. 1.8 million second level groups (b) in total (not per a). 2 c level groups per b.

I made a mistake in the original. In the c level groups they each have 4 datasets. The number of b groups may vary quite a bit between a groups.

What I have checked is the:

  • File fragmentation
  • Compression
  • Which drive they are on
  • Whether the data was chunked (it isn't)

I have also tested with a fresh conda environment.

@takluyver
Copy link
Member

In the code you show, it looks like you're not actually loading the data. _ = dataset[key] looks up the dataset in the file, but it doesn't read it - datasets can be large, so h5py makes it easy to read part of a dataset. If you do want to try reading the data, change this line to something like:

# Allocate a new array, read whole dataset
_ = dataset[key][()]

# OR reuse an existing array of the right shape & type
dataset[key].read_direct(arr)

I would guess that this would make the difference much less dramatic.

It's still interesting that you see such a big difference in looking up datasets, though. I could make wild guesses about what might be going on, but I don't have any great ideas. I suspect there's not much we can work out unless you can share enough detail about creating the files for other people to replicate it.

@OWissett
Copy link
Author

OWissett commented May 8, 2024

I am not able to provide details about the files due to it containing unpublished research data.

@takluyver
Copy link
Member

I understand that, but it's not unusual for people to reproduce issues like this by writing code to produce similar files with meaningless data.

@tacaswell
Copy link
Member

I suspect that this is fundementaally a duplicate of #1055 where very wide hdf5 files have poor performance.

I agree we need a script that generates representative files to do much more debugging (if I am correct about the issue here, the exact names and actual data values are irrelevant).

@ajelenak
Copy link
Contributor

ajelenak commented May 8, 2024

Would be interesting to test if Group.visititems() helps. This should iterate over all the groups only once.

Why so many groups in the file? Do they hold some kind of data in their names? If so, a "table of content" dataset could help to go straight to the desired group and its datasets. The TOC dataset would combine all the data in group names that is queried plus an object reference to the group/dataset that is the result of a specific query. So for a group path /data_a/data_b/data_c/, a TOC dataset entry would be [data_a, data_b, data_c, <objref to /data_a/data_b/data_c/>].

@OWissett
Copy link
Author

The groups do hold data in their names. The data in question is going to be used to train machine learning models with the name of the group being the name of the data entry. I didn't realise that wide files have poor performance. Is it better practice to create fewer groups with larger datasets in them?

Thanks for the advice. I will try and get you a sample version of the file next week. Sorry for the difficulties with this.

Many thanks.

@ajelenak
Copy link
Contributor

ajelenak commented May 10, 2024

I didn't realise that wide files have poor performance. Is it better practice to create fewer groups with larger datasets in them?

It is more accurate to say that default libhdf5 settings may not apply well to your files. There are some storage features that could help but they need to be applied at file creation.

HDF Group holds regular weekly live Q&A sessions on Tuesdays at 1:20pm US Eastern time. If you can, I think bringing up this use case would be a very nice opportunity to get immediate feedback from HDFG staff on possible solutions to try. There's nothing special to prepare, just show up to explain the problem and answer any question about the files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants