Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default chunksize with unlimited dimensions leads to huge output files #2029

Open
dionhaefner opened this issue Jan 5, 2022 · 11 comments · May be fixed by #2036
Open

Default chunksize with unlimited dimensions leads to huge output files #2029

dionhaefner opened this issue Jan 5, 2022 · 11 comments · May be fixed by #2036

Comments

@dionhaefner
Copy link

Over at h5netcdf, we noticed that writing files with unlimited dimensions resulted in ~100x larger file sizes (2GB vs. 20MB) compared to using netCDF4 (h5netcdf/h5netcdf#52). This is caused by the different chunking heuristics used by h5py vs. netCDF4. The netCDF4 one is here:

https://github.com/Unidata/netcdf-c/blob/a57101d4b7dcabf7d477c08eee9a5a126732f702/libhdf5/hdf5var.c#L104-L225

Besides targeting larger chunk sizes than h5py (16MB vs. 500kB, with a default cache size of 32MB) it also gives unlimited dimensions a chunk size of 1. h5py uses a base size of 1024 which is then halved iteratively (along with all other chunk sizes) until the total size lies within the target range:

h5py/h5py/_hl/filters.py

Lines 348 to 349 in 1d569e6

# For unlimited dimensions we have to guess 1024
shape = tuple((x if x!=0 else 1024) for i, x in enumerate(shape))

Now the problem with this is that in all cases I've seen, unlimited dimensions are used for something like a time axis that you append to in every iteration of a simulation. So in my experience a chunk size of 1 makes a lot more sense if you have only 1 unlimited dimension and several fixed dimensions.

We found that in some quick and dirty benchmarks, the h5py heuristic leads to ~10% faster reads and writes than the netCDF4 one in the absence of unlimited dimensions, so it seems to work well overall.

To circumvent the issue I would suggest a heuristic similar to this:

def guess_chunk(shape, maxshape, typesize):
    # this part is similar to the netCDF4 heuristic
    num_dim = len(shape)
    num_unlimited = sum(x == 0 for x in shape)
    
    if num_dim == num_unlimited:
        # all dimensions are unlimited, keep current behavior
        unlimited_base_size = 1024
    else:
        unlimited_base_size = 1
    
    shape = tuple((x if x!=0 else unlimited_base_size) for i, x in enumerate(shape))
    # proceed with h5py heuristic
    ...

Do you have any experience with this issue, and / or would you be interested in adopting a modified heuristic like this?

@tacaswell
Copy link
Member

If you use a compression filter does the problem go away?

From the comments in h5netcdf/h5netcdf#52 I infer that you have many relatively short (in the unlimited direction) data sets?

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 6, 2022

Invoking compression with compression="gzip", compression_opts=4, shuffle=True which should be complementary to the netcdf4 standard zlib=True (using complevel=4 and shuffle=True) I get for the example data from h5netcdf/h5netcdf#52

compression = 4
h5py -> 20 MB, execution time: 8.490660667419434
netcdf4 -> 16 MB, execution time: 0.48923802375793457

compression = 3
h5py -> 24 MB, execution time: 3.5701050758361816
netcdf4 -> 16 MB, execution time: 0.44800567626953125

This is not 100 times larger but still 25%/50%. Also the execution time is significantly larger. The automatic chunk sizes do not fit well for the dataset with the unlimited dimension in this use case.

Update: Added/fixed default compression level. Added execution time.

@kmuehlbauer
Copy link
Contributor

From the comments in h5netcdf/h5netcdf#52 I infer that you have many relatively short (in the unlimited direction) data sets?

Yes. that is correct. The dataset is extended over the unlimited dimension in single steps of size 1. So the best chunk size for this unlimited dimension would be 1.

@dionhaefner
Copy link
Author

Yes, and my point was more general: unlimited dimensions are typically used for "timestep", where you almost always append 1 at a time.

@ajelenak
Copy link
Contributor

ajelenak commented Jan 6, 2022

Chunks of shape (time=1, ...) often perform poorly for those who want to extract time series. I think those who want to use chunking should explicitly specify chunk shape and not rely on any formula for guessing.

@dionhaefner
Copy link
Author

Chunking is mandatory when using unlimited dimensions, so IMO we should try to supply a sane default.

The current default writes 1024x larger files than needed (in the worst case), which to me sounds worse than degraded reading performance. But then again you probably know better how people use these features in the wild, so if you disagree we can close this.

@takluyver
Copy link
Member

Going to the other extreme, if you make a dataset with maxshape=(5, None), the suggested heuristic would give it chunks of (5, 1), which is pathologically small, making it slower to read data. Maybe that's not so common, though I've seen EuXFEL datasets with maxshape=(1, None) (not created through Python, so they wouldn't be directly affected).

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down, so it would get to (250, 250, 40, 1) before trying to cut down the fixed dimensions. That would allow it to still produce a larger chunk if the other dimensions are very small. OTOH, heuristics are hard, and maybe that breaks something else I haven't thought of. It obviously assumes that you'll tend to read & write pieces across (not along) the unlimited dimension.

@dionhaefner
Copy link
Author

Yes, I agree. It also occurred to me that this would imply radically different chunks for maxshape=(None,) and maxshape=(None, 1), which feels icky.

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down

Excellent suggestion IMO :) Much better than what I came up with.

@tacaswell
Copy link
Member

I second everything @takluyver said.

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down

That makes a lot of sense to me, the set of fixed dimensions likely represents some notional "quanta" of data from the users point of view. In turn that strongly suggests that they are going to want to both read and write in multiples of that size so we should try to keep that into a single chunk when possible!

@dionhaefner
Copy link
Author

Looks like we're on the same page. I could put that into a PR soon unless someone objects.

@dionhaefner dionhaefner linked a pull request Jan 10, 2022 that will close this issue
@tpongo-afk
Copy link

I'm not sure if that's the problem actually.
One mistake I noticed is that the definition of unlimited dimensions is bad.
Line 349 in filters.py should be:
shape = tuple((x if y is not None else 1024) for x, y in enumerate(shape, maxshape))
That is, a dimension is unlimited when the maxshape is None, not when the shape is 0 along the axis.
Supposing that maxshape is a tuple, although it can still be True in principle because of the following part: (the True possibility is not documented)

h5py/h5py/_hl/filters.py

Lines 247 to 253 in b3697ab

if (chunks is True) or \
(chunks is None and any((shuffle, fletcher32, compression, maxshape,
scaleoffset is not None))):
chunks = guess_chunk(shape, maxshape, dtype.itemsize)
if maxshape is True:
maxshape = (None,)*len(shape)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants