New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default chunksize with unlimited dimensions leads to huge output files #2029
Comments
If you use a compression filter does the problem go away? From the comments in h5netcdf/h5netcdf#52 I infer that you have many relatively short (in the unlimited direction) data sets? |
Invoking compression with compression = 4 compression = 3 This is not 100 times larger but still 25%/50%. Also the execution time is significantly larger. The automatic chunk sizes do not fit well for the dataset with the unlimited dimension in this use case. Update: Added/fixed default compression level. Added execution time. |
Yes. that is correct. The dataset is extended over the unlimited dimension in single steps of size 1. So the best chunk size for this unlimited dimension would be 1. |
Yes, and my point was more general: unlimited dimensions are typically used for "timestep", where you almost always append 1 at a time. |
Chunks of shape |
Chunking is mandatory when using unlimited dimensions, so IMO we should try to supply a sane default. The current default writes 1024x larger files than needed (in the worst case), which to me sounds worse than degraded reading performance. But then again you probably know better how people use these features in the wild, so if you disagree we can close this. |
Going to the other extreme, if you make a dataset with Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down, so it would get to |
Yes, I agree. It also occurred to me that this would imply radically different chunks for
Excellent suggestion IMO :) Much better than what I came up with. |
I second everything @takluyver said.
That makes a lot of sense to me, the set of fixed dimensions likely represents some notional "quanta" of data from the users point of view. In turn that strongly suggests that they are going to want to both read and write in multiples of that size so we should try to keep that into a single chunk when possible! |
Looks like we're on the same page. I could put that into a PR soon unless someone objects. |
I'm not sure if that's the problem actually. Lines 247 to 253 in b3697ab
|
Over at h5netcdf, we noticed that writing files with unlimited dimensions resulted in ~100x larger file sizes (2GB vs. 20MB) compared to using netCDF4 (h5netcdf/h5netcdf#52). This is caused by the different chunking heuristics used by h5py vs. netCDF4. The netCDF4 one is here:
https://github.com/Unidata/netcdf-c/blob/a57101d4b7dcabf7d477c08eee9a5a126732f702/libhdf5/hdf5var.c#L104-L225
Besides targeting larger chunk sizes than
h5py
(16MB vs. 500kB, with a default cache size of 32MB) it also gives unlimited dimensions a chunk size of 1.h5py
uses a base size of 1024 which is then halved iteratively (along with all other chunk sizes) until the total size lies within the target range:h5py/h5py/_hl/filters.py
Lines 348 to 349 in 1d569e6
Now the problem with this is that in all cases I've seen, unlimited dimensions are used for something like a time axis that you append to in every iteration of a simulation. So in my experience a chunk size of 1 makes a lot more sense if you have only 1 unlimited dimension and several fixed dimensions.
We found that in some quick and dirty benchmarks, the h5py heuristic leads to ~10% faster reads and writes than the netCDF4 one in the absence of unlimited dimensions, so it seems to work well overall.
To circumvent the issue I would suggest a heuristic similar to this:
Do you have any experience with this issue, and / or would you be interested in adopting a modified heuristic like this?
The text was updated successfully, but these errors were encountered: