Default chunksize with unlimited dimensions leads to huge output files #2029

dionhaefner · 2022-01-05T09:06:42Z

Over at h5netcdf, we noticed that writing files with unlimited dimensions resulted in ~100x larger file sizes (2GB vs. 20MB) compared to using netCDF4 (h5netcdf/h5netcdf#52). This is caused by the different chunking heuristics used by h5py vs. netCDF4. The netCDF4 one is here:

https://github.com/Unidata/netcdf-c/blob/a57101d4b7dcabf7d477c08eee9a5a126732f702/libhdf5/hdf5var.c#L104-L225

Besides targeting larger chunk sizes than h5py (16MB vs. 500kB, with a default cache size of 32MB) it also gives unlimited dimensions a chunk size of 1. h5py uses a base size of 1024 which is then halved iteratively (along with all other chunk sizes) until the total size lies within the target range:

h5py/h5py/_hl/filters.py

Lines 348 to 349 in 1d569e6

    
           # For unlimited dimensions we have to guess 1024 
        
           shape = tuple((x if x!=0 else 1024) for i, x in enumerate(shape))

Now the problem with this is that in all cases I've seen, unlimited dimensions are used for something like a time axis that you append to in every iteration of a simulation. So in my experience a chunk size of 1 makes a lot more sense if you have only 1 unlimited dimension and several fixed dimensions.

We found that in some quick and dirty benchmarks, the h5py heuristic leads to ~10% faster reads and writes than the netCDF4 one in the absence of unlimited dimensions, so it seems to work well overall.

To circumvent the issue I would suggest a heuristic similar to this:

def guess_chunk(shape, maxshape, typesize):
    # this part is similar to the netCDF4 heuristic
    num_dim = len(shape)
    num_unlimited = sum(x == 0 for x in shape)
    
    if num_dim == num_unlimited:
        # all dimensions are unlimited, keep current behavior
        unlimited_base_size = 1024
    else:
        unlimited_base_size = 1
    
    shape = tuple((x if x!=0 else unlimited_base_size) for i, x in enumerate(shape))
    # proceed with h5py heuristic
    ...

Do you have any experience with this issue, and / or would you be interested in adopting a modified heuristic like this?

The text was updated successfully, but these errors were encountered:

tacaswell · 2022-01-06T02:00:19Z

If you use a compression filter does the problem go away?

From the comments in h5netcdf/h5netcdf#52 I infer that you have many relatively short (in the unlimited direction) data sets?

kmuehlbauer · 2022-01-06T06:26:38Z

Invoking compression with compression="gzip", compression_opts=4, shuffle=True which should be complementary to the netcdf4 standard zlib=True (using complevel=4 and shuffle=True) I get for the example data from h5netcdf/h5netcdf#52

compression = 4
h5py -> 20 MB, execution time: 8.490660667419434
netcdf4 -> 16 MB, execution time: 0.48923802375793457

compression = 3
h5py -> 24 MB, execution time: 3.5701050758361816
netcdf4 -> 16 MB, execution time: 0.44800567626953125

This is not 100 times larger but still 25%/50%. Also the execution time is significantly larger. The automatic chunk sizes do not fit well for the dataset with the unlimited dimension in this use case.

Update: Added/fixed default compression level. Added execution time.

kmuehlbauer · 2022-01-06T06:28:31Z

From the comments in h5netcdf/h5netcdf#52 I infer that you have many relatively short (in the unlimited direction) data sets?

Yes. that is correct. The dataset is extended over the unlimited dimension in single steps of size 1. So the best chunk size for this unlimited dimension would be 1.

dionhaefner · 2022-01-06T08:05:47Z

Yes, and my point was more general: unlimited dimensions are typically used for "timestep", where you almost always append 1 at a time.

ajelenak · 2022-01-06T15:33:37Z

Chunks of shape (time=1, ...) often perform poorly for those who want to extract time series. I think those who want to use chunking should explicitly specify chunk shape and not rely on any formula for guessing.

dionhaefner · 2022-01-06T15:50:31Z

Chunking is mandatory when using unlimited dimensions, so IMO we should try to supply a sane default.

The current default writes 1024x larger files than needed (in the worst case), which to me sounds worse than degraded reading performance. But then again you probably know better how people use these features in the wild, so if you disagree we can close this.

takluyver · 2022-01-07T12:10:38Z

Going to the other extreme, if you make a dataset with maxshape=(5, None), the suggested heuristic would give it chunks of (5, 1), which is pathologically small, making it slower to read data. Maybe that's not so common, though I've seen EuXFEL datasets with maxshape=(1, None) (not created through Python, so they wouldn't be directly affected).

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down, so it would get to (250, 250, 40, 1) before trying to cut down the fixed dimensions. That would allow it to still produce a larger chunk if the other dimensions are very small. OTOH, heuristics are hard, and maybe that breaks something else I haven't thought of. It obviously assumes that you'll tend to read & write pieces across (not along) the unlimited dimension.

dionhaefner · 2022-01-07T12:23:31Z

Yes, I agree. It also occurred to me that this would imply radically different chunks for maxshape=(None,) and maxshape=(None, 1), which feels icky.

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down

Excellent suggestion IMO :) Much better than what I came up with.

tacaswell · 2022-01-07T20:19:11Z

I second everything @takluyver said.

Maybe rather than going straight to 1 for unlimited dimensions, the heuristic should prioritise cutting the unlimited dimensions down

That makes a lot of sense to me, the set of fixed dimensions likely represents some notional "quanta" of data from the users point of view. In turn that strongly suggests that they are going to want to both read and write in multiples of that size so we should try to keep that into a single chunk when possible!

dionhaefner · 2022-01-07T23:01:40Z

Looks like we're on the same page. I could put that into a PR soon unless someone objects.

tpongo-afk · 2023-01-25T23:03:14Z

I'm not sure if that's the problem actually.
One mistake I noticed is that the definition of unlimited dimensions is bad.
Line 349 in filters.py should be:
shape = tuple((x if y is not None else 1024) for x, y in enumerate(shape, maxshape))
That is, a dimension is unlimited when the maxshape is None, not when the shape is 0 along the axis.
Supposing that maxshape is a tuple, although it can still be True in principle because of the following part: (the True possibility is not documented)

h5py/h5py/_hl/filters.py

Lines 247 to 253 in b3697ab

    
           if (chunks is True) or \ 
        
           (chunks is None and any((shuffle, fletcher32, compression, maxshape, 
        
                                    scaleoffset is not None))): 
        
               chunks = guess_chunk(shape, maxshape, dtype.itemsize) 
        
           if maxshape is True: 
        
               maxshape = (None,)*len(shape)

dionhaefner mentioned this issue Jan 5, 2022

implement new h5netcdf default chunksize h5netcdf/h5netcdf#127

Merged

3 tasks

dionhaefner linked a pull request Jan 10, 2022 that will close this issue

Reduce chunksizes for unlimited dimensions first #2036

Open

tpongo-afk mentioned this issue Jan 27, 2023

Add chunking property if necessary and not set by user BlueBrain/HighFive#679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default chunksize with unlimited dimensions leads to huge output files #2029

Default chunksize with unlimited dimensions leads to huge output files #2029

dionhaefner commented Jan 5, 2022

tacaswell commented Jan 6, 2022

kmuehlbauer commented Jan 6, 2022 •

edited

kmuehlbauer commented Jan 6, 2022

dionhaefner commented Jan 6, 2022

ajelenak commented Jan 6, 2022

dionhaefner commented Jan 6, 2022

takluyver commented Jan 7, 2022

dionhaefner commented Jan 7, 2022

tacaswell commented Jan 7, 2022

dionhaefner commented Jan 7, 2022

tpongo-afk commented Jan 25, 2023

Default chunksize with unlimited dimensions leads to huge output files #2029

Default chunksize with unlimited dimensions leads to huge output files #2029

Comments

dionhaefner commented Jan 5, 2022

tacaswell commented Jan 6, 2022

kmuehlbauer commented Jan 6, 2022 • edited

kmuehlbauer commented Jan 6, 2022

dionhaefner commented Jan 6, 2022

ajelenak commented Jan 6, 2022

dionhaefner commented Jan 6, 2022

takluyver commented Jan 7, 2022

dionhaefner commented Jan 7, 2022

tacaswell commented Jan 7, 2022

dionhaefner commented Jan 7, 2022

tpongo-afk commented Jan 25, 2023

kmuehlbauer commented Jan 6, 2022 •

edited