New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow hdf5 loading #2425
Comments
This is not really helpful. Can you list at least some of those suggestions? Are |
Yeah sorry, wasnt clear. Mine Matt's I made a mistake in the original. In the What I have checked is the:
I have also tested with a fresh conda environment. |
In the code you show, it looks like you're not actually loading the data. # Allocate a new array, read whole dataset
_ = dataset[key][()]
# OR reuse an existing array of the right shape & type
dataset[key].read_direct(arr) I would guess that this would make the difference much less dramatic. It's still interesting that you see such a big difference in looking up datasets, though. I could make wild guesses about what might be going on, but I don't have any great ideas. I suspect there's not much we can work out unless you can share enough detail about creating the files for other people to replicate it. |
I am not able to provide details about the files due to it containing unpublished research data. |
I understand that, but it's not unusual for people to reproduce issues like this by writing code to produce similar files with meaningless data. |
I suspect that this is fundementaally a duplicate of #1055 where very wide hdf5 files have poor performance. I agree we need a script that generates representative files to do much more debugging (if I am correct about the issue here, the exact names and actual data values are irrelevant). |
Would be interesting to test if Group.visititems() helps. This should iterate over all the groups only once. Why so many groups in the file? Do they hold some kind of data in their names? If so, a "table of content" dataset could help to go straight to the desired group and its datasets. The TOC dataset would combine all the data in group names that is queried plus an object reference to the group/dataset that is the result of a specific query. So for a group path |
The groups do hold data in their names. The data in question is going to be used to train machine learning models with the name of the group being the name of the data entry. I didn't realise that wide files have poor performance. Is it better practice to create fewer groups with larger datasets in them? Thanks for the advice. I will try and get you a sample version of the file next week. Sorry for the difficulties with this. Many thanks. |
It is more accurate to say that default libhdf5 settings may not apply well to your files. There are some storage features that could help but they need to be applied at file creation. HDF Group holds regular weekly live Q&A sessions on Tuesdays at 1:20pm US Eastern time. If you can, I think bringing up this use case would be a very nice opportunity to get immediate feedback from HDFG staff on possible solutions to try. There's nothing special to prepare, just show up to explain the problem and answer any question about the files. |
To assist reproducing bugs, please include the following:
conda
I have two relatively large HDF5 files, one is 16GB and the other 30GB. Both files follow the same basic schema:
/a/b/c
wherec
is a group which then contains several datasets (always the same dataset names and ranks, but with different data).One file loads perfectly fine, and the other is very slow. Here is a histogram of the loading times:
Here is my profiling code:
I am not able to provide the actual data as it is confidential.
There are approximately 3 million
c
level groups in mine, and 2 millionc
level groups in Matt's. Both files are located on the same filesystem and the same NVMe drive. Neither uses chunked data (all datasets are CONTIG), or compression.Both files were created in a nearly identical way, using the
rust
hdf5 crate, and then usingh5py
to filter out datasets usingh5py.copy
. Mine did have an additional step where I had to rename some of the groups usingh5py.move
.Does anyone have any idea why this might be happening?
I have checked everything mentioned on google or by ChatGPT.
The text was updated successfully, but these errors were encountered: