Reduce chunksizes for unlimited dimensions first #2036

dionhaefner · 2022-01-10T21:19:12Z

Fixes #2029. See this issue for discussion / background on this change. In a nutshell, this prevents the presence of unlimited dimensions from blowing up resulting file sizes (in cases where unlimited dimensions stay small).

Some sanity checks that illustrate how the new method works:

>>> from h5py._hl.filters import guess_chunk
>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (32, 16, 16, 8)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (1, 16, 32, 15)
>>> guess_chunk((0,), None, 8)
(1024,)  # master gives (1024,)
>>> guess_chunk((0,) * 10, None, 8)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2)  # master gives (2, 2, 2, 2, 4, 4, 4, 4, 4, 4)
>>> guess_chunk((0, 1), None, 8)
(1024, 1)  # master gives (1024, 1)
>>> guess_chunk((2,) * 20 + (0,), None, 8)
(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)
# master gives (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1024)

Note how the last case would lead to a 1024x larger file on master.

Let me know if you want me to add additional unit tests.

codecov · 2022-01-10T21:32:32Z

Codecov Report

Merging #2036 (2930f91) into master (aa31f03) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2036      +/-   ##
==========================================
+ Coverage   89.90%   89.91%   +0.01%     
==========================================
  Files          17       17              
  Lines        2307     2310       +3     
==========================================
+ Hits         2074     2077       +3     
  Misses        233      233

Impacted Files	Coverage Δ
h5py/_hl/filters.py	`92.96% <100.00%> (+0.10%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa31f03...2930f91. Read the comment docs.

tacaswell · 2022-01-11T02:54:04Z

h5py/_hl/filters.py

    if not np.all(np.isfinite(chunks)):
        raise ValueError("Illegal value in chunk tuple")

    # Determine the optimal chunk size in bytes using a PyTables expression.
    # This is kept as a float.
-    dset_size = np.product(chunks)*typesize
+    dset_size = np.product(chunks[~is_unlimited])*typesize


What happens if you do not exclude the unlimited dimensions from this target? This is currently producing much smaller chunks than master branch does.

I thought it would be nice for the chunks to match in this case:

>>> guess_chunk((0, 250, 250, 120), None, 8) (1, 16, 32, 15) # master gives (32, 16, 16, 8) >>> guess_chunk((1, 250, 250, 120), None, 8) (1, 16, 32, 15) # master gives (1, 16, 32, 15)

So this way for large fixed dimensions, adding an unlimited dimension is the same as adding a fixed dimension of size 1 (in terms of the resulting chunks).

Without this, you get that:

>>> guess_chunk((0, 250, 250, 120), None, 8) (1, 32, 63, 30) >>> guess_chunk((1, 250, 250, 120), None, 8) (1, 16, 32, 15) >>> guess_chunk((0,), None, 8) (1024,) >>> guess_chunk((0,) * 10, None, 8) (2, 2, 2, 2, 4, 4, 4, 4, 4, 4) >>> guess_chunk((0, 1), None, 8) (1024, 1) >>> guess_chunk((2,) * 20 + (0,), None, 8) (1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)

Not bad either, so if you prefer that we can change it.

tacaswell · 2022-01-11T03:01:11Z

Note how the last case would lead to a 1024x larger file on master.

I think that is only the case if you only put 1 element along the unlimited dimension, the more you have the cost gets amortized (but you probably get weird step ups when you cross to needing another chunk)?

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

takluyver · 2022-01-11T12:23:18Z

Note how the last case would lead to a 1024x larger file on master.

I think that is only the case if you only put 1 element along the unlimited dimension, the more you have the cost gets amortized

Yup, I certainly wouldn't consider it a typical use case to define an unlimited dimension and then only add one entry on that dimension, so wasted space if you do that and accept h5py's guess for chunk shape isn't a big concern.

A 21-dimensional dataset is also unrealistic. It's definitely worth exploring how it behaves in extreme cases like this, but I would focus on making good guesses for up to maybe 5-6 dimensions.

The interesting case for this change is patterns in how you read the data. For the use cases I've seen at EuXFEL, it's pretty common to read a full frame for a single point along the unlimited axis - e.g. ds[123, :, :] if the first axis is unlimited. Having a chunk shape like (1, 256, 256) for that example means that reads in that pattern need a single chunk, and then use it entirely, so they're more efficient. A read like ds[:, 0, 0] would be very inefficient with that chunk, though - it would have to load every chunk and then throw away all but one number.

I'm still somewhat wary about this. Any guess at the right chunk shape is going to be wrong for someone - maybe it's better to keep it consistent than to try to be smarter. Treating all dimensions equally also reduces the risk that the chunking is catastrophically bad for any particular access pattern. On the other hand, this would guess chunks similar to the shapes our datasets use by design. 🤷

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

I'm not sure exactly what HDF5 does, but I think this is at least mitigated by chunk caching. Ideally, the chunk can live in the cache while you modify it, and only be written out once you're finished. But that assumes the cache is big enough to hold all the chunks you're modifying.

dionhaefner · 2022-01-11T12:36:55Z

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

~~Yes, that seems to be what's happening.~~ Not file size, but compute! It looks like the space with the outdated data is reclaimed and re-compressed, leading to a steady increase of time to write a slice (while also starting 10x slower than with reasonable chunks). I also see a factor of 10 difference in writing speed when not using compression, but it doesn't get worse over time.

master:

using chunks: (16, 16, 3, 64)
i: 1, size: 13MB, time: 3.044s
i: 2, size: 22MB, time: 3.538s
i: 3, size: 31MB, time: 3.639s
i: 4, size: 39MB, time: 3.792s
i: 5, size: 48MB, time: 3.974s
i: 6, size: 57MB, time: 4.303s
i: 7, size: 66MB, time: 4.594s
i: 8, size: 74MB, time: 4.682s
i: 9, size: 83MB, time: 4.904s
i: 10, size: 92MB, time: 5.142s
i: 11, size: 100MB, time: 5.559s
i: 12, size: 109MB, time: 5.688s
i: 13, size: 118MB, time: 6.008s
i: 14, size: 126MB, time: 6.282s
i: 15, size: 135MB, time: 6.516s
i: 16, size: 144MB, time: 6.720s
i: 17, size: 152MB, time: 7.305s
i: 18, size: 161MB, time: 7.530s
i: 19, size: 170MB, time: 7.666s
i: 20, size: 179MB, time: 8.122s

This PR:

using chunks: (32, 32, 10, 1)
i: 1, size: 8MB, time: 0.327s
i: 2, size: 17MB, time: 0.322s
i: 3, size: 25MB, time: 0.312s
i: 4, size: 34MB, time: 0.308s
i: 5, size: 42MB, time: 0.308s
i: 6, size: 51MB, time: 0.330s
i: 7, size: 60MB, time: 0.320s
i: 8, size: 68MB, time: 0.324s
i: 9, size: 77MB, time: 0.314s
i: 10, size: 85MB, time: 0.327s
i: 11, size: 94MB, time: 0.324s
i: 12, size: 103MB, time: 0.314s
i: 13, size: 111MB, time: 0.320s
i: 14, size: 120MB, time: 0.316s
i: 15, size: 128MB, time: 0.320s
i: 16, size: 137MB, time: 0.323s
i: 17, size: 146MB, time: 0.314s
i: 18, size: 154MB, time: 0.315s
i: 19, size: 163MB, time: 0.312s
i: 20, size: 171MB, time: 0.317s

Test script:

import os
import tempfile
import time

import h5py

import numpy as np


SHAPE = (250, 250, 40, 0)

KWARGS = {
    "compression": "gzip",
}


def write_test_file(outfile):
    with h5py.File(outfile, "w") as f:
        d = f.create_dataset(
            "test",
            shape=SHAPE,
            maxshape=tuple(s or None for s in SHAPE),
            chunks=True,
            **KWARGS
        )
        print(f"using chunks: {d.chunks}")

        f.flush()

        for i in range(1, 100):
            start = time.perf_counter()
            d.resize(i, axis=3)
            d[..., i-1] = np.random.rand(*d.shape[:-1])
            f.flush()
            stop = time.perf_counter()

            print(f"i: {i}, size: {get_filesize_mb(outfile)}MB, time: {stop - start:.3f}s")


def get_filesize_mb(infile):
    return os.stat(infile).st_size // 1024 ** 2


if __name__ == "__main__":
    outfile = tempfile.NamedTemporaryFile(delete=False)
    outfile.close()

    try:
        write_test_file(outfile.name)
    finally:
        os.remove(outfile.name)

dionhaefner · 2022-01-11T12:49:38Z

Yup, I certainly wouldn't consider it a typical use case to define an unlimited dimension and then only add one entry on that dimension, so wasted space if you do that and accept h5py's guess for chunk shape isn't a big concern.

I beg to differ. If you have a simulation that outputs in regular intervals, every file starts out with a single entry along the unlimited (time) axis. Often this is amortized by running the simulation longer, often it's not. If you need more convincing, look at the timings above. A 10x difference in writing surely matters?

The interesting case for this change is patterns in how you read the data. For the use cases I've seen at EuXFEL, it's pretty common to read a full frame for a single point along the unlimited axis - e.g. ds[123, :, :] if the first axis is unlimited. Having a chunk shape like (1, 256, 256) for that example means that reads in that pattern need a single chunk, and then use it entirely, so they're more efficient. A read like ds[:, 0, 0] would be very inefficient with that chunk, though - it would have to load every chunk and then throw away all but one number.

I agree that this is an issue, so the performance implications should be well documented. If the user anticipates that people will read "time series" from their dataset, IMO they should either supply chunks manually or even re-chunk after writing (given that suboptimal chunks can slow you down by a factor of 10).

takluyver · 2022-01-11T14:21:45Z

If you need more convincing, look at the timings above. A 10x difference in writing surely matters?

Yup, the timings are a more compelling argument to my mind than the possibility of wasted space. But the demo is showing what I'd expect anyway: performance is better when your reads or writes line up nicely with the chunk shape. You're doing writes in a pattern that fits your assumption, so naturally using a chunk shape based on that assumption gives better performance.

The questions that I'm trying to get at are:

How good is the assumption that reads/writes will usually access a small slice of the unlimited dimensions and a large slice of the limited ones? It holds for your use case and (mostly) for the ones I've seen at work, but are there other realistic scenarios which would be made worse? This is hard to answer, because someone negatively affected by a change like this will probably only tell us once we've merged and released it.
How do we balance 'good for the most likely use case' vs 'not terrible for any use case'? You've shown that one particular access pattern is 10x slower than it could be with the current guess. But maybe something that's 10x slower than ideal for any extreme (e.g. 'frame' access vs 'timeseries' access) is better than something that's ideal for one extreme but 100x or 1000x slower for the other? 🤔
How do we balance 'potential improvement' vs 'least surprise'? If someone is implicitly relying on the current chunking heuristic, how cautious should we be about changing it and potentially breaking what they're doing?

the performance implications should be well documented. If the user anticipates that people will read "time series" from your dataset they should either supply chunks manually or even re-chunk after writing

Documenting things clearly is good, but no matter how well we do that, a lot of users are not going to read it. Especially for something like this - a user might want compression or resizing and not even realise that that means their data will be broken up into chunks, or that h5py is guessing how best to do that. If we think the change in behaviour requires a warning in the docs to think about when you might want to override it, that could be a red flag that we should leave it alone.

I'm not as negative on this change as this all sounds. 😉 I think it's probably an improvement in most cases that would be affected. But I want to think through this stuff, because I've seen enough cases on various open source projects where something that seemed like an improvement turned out to cause another problem, and it only became clear after a release.

dionhaefner · 2022-01-11T20:54:35Z

I agree with your assessment, except this point:

If we think the change in behaviour requires a warning in the docs to think about when you might want to override it, that could be a red flag that we should leave it alone.

IMO it already requires a warning in the current state, just a different one.

Anyhow, we got these arguments pro:

In the best case, 1024 times faster reads (when reading slices along the unlimited axes, which so far everyone agreed is the most likely use case)
Unconditionally faster writes (up to 1024 times faster in best case, same performance in worst case when writing nchunk slices at once)
Unconditionally smaller files (same condition as above)

And these arguments con:

In the worst case, 1024 times slower reads (when reading scalar slices across the unlimited axes)
Slightly slower reads in the "mixed" case (partial read / slice of all dimensions)
A change will create friction with users who rely on the current behavior (but also miraculously speed things up for others)

I'm not going to nudge you more than that, if you think it's not worth it that's OK.

ajelenak · 2022-01-11T20:55:01Z

The discussion so far illustrates the typical challenges of coming up with an acceptable "I am feeling lucky" chunk shape algorithm.

May I suggest to first address the 1024*1024, i.e. CHUNK_MAX, value in the current code? That value seems to be the default dataset chunk cache size, which is both too old and too small for today's computers. Even if someone could change the cache size, the guessing algorithm would not be aware of that and produce smaller chunk shapes.

How about we set the dataset chunk cache to 16 MiB? And modify the code to get the cache size, rather than use a hard-coded value?

dionhaefner · 2022-01-11T21:00:08Z

How about we set the dataset chunk cache to 16 MiB? And modify the code to get the cache size, rather than use a hard-coded value?

To add 1 data point to this, I did some quick and dirty benchmarks of this vs. the default and found that the default is ~10% faster (h5netcdf/h5netcdf#127 (comment)). Could be that this would be a better default but to me this sounds like another can of worms that would require a lot of benchmarking across different systems and setups. Also it doesn't solve the problem that this PR addresses.

kmuehlbauer · 2022-01-19T06:53:39Z

Thanks for the great discussion. It's really good to change perspective every once in a while.

There might be another option to resolve the problem we have over at h5netcdf/h5netcdf#127 on the h5py-side.

Would it be possible to create a dataset giving chunks like this?

dset = f.create_dataset("chunked", (1, 1000), maxshape=(None, 1000), chunks=(1, None))

guess_chunk could treat the dimensions which are given with a number as preset and only autochunk the remaining (None) dimensions.

h5py/h5py/_hl/filters.py

Lines 247 to 250 in aa31f03

    
           if (chunks is True) or \ 
        
           (chunks is None and any((shuffle, fletcher32, compression, maxshape, 
        
                                    scaleoffset is not None))): 
        
               chunks = guess_chunk(shape, maxshape, dtype.itemsize)

This would really help. Much better than reimplementing yet another autochunker in h5netcdf, which could divert over the time from it's parent (see comment in guess_chunks - Undocumented and subject to change without warning.)

kmuehlbauer · 2022-01-19T07:30:23Z

@takluyver

I'm not as negative on this change as this all sounds. wink I think it's probably an improvement in most cases that would be affected. But I want to think through this stuff, because I've seen enough cases on various open source projects where something that seemed like an improvement turned out to cause another problem, and it only became clear after a release.

I fully agree. An enhancement like along the lines of my above suggestion would prevent this problem, but give downstream users more freedom when choosing chunk sizes of their liking. I'm just unsure if it makes sense at all.

tacaswell · 2022-01-19T22:19:42Z

guess_chunk could treat the dimensions which are given with a number as preset and only autochunk the remaining (None) dimensions.

This seems like a good idea to me as well.

ajelenak · 2022-01-20T01:52:23Z

I think it would be really helpful to define some overall goals for the new guess_chunk. For example:

Should it favor data creators or data analysts? (My vote is for the latter.)
Should it take into account the actual size of the dataset's chunk cache and should this default cache be increased? (Yes, yes.) Maybe we should add dataset chunk cache size as a parameter to create_dataset?

Data analysts would benefit from a more balanced chunk shape in all dimensions as their dataset read access is more random. A larger dataset chunk cache would enable larger chunk sizes and thus shapes. Data creators don't care much for more than one chunk in the cache because they are writing out data, so larger cache would allow keeping even those balanced chunks in the cache so they can fill them "by one" along some dimension if desired.

reduce chunksizes for unlimited dimensions first

2930f91

tacaswell reviewed Jan 11, 2022

View reviewed changes

dionhaefner mentioned this pull request Jan 17, 2022

implement new h5netcdf default chunksize h5netcdf/h5netcdf#127

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce chunksizes for unlimited dimensions first #2036

Reduce chunksizes for unlimited dimensions first #2036

dionhaefner commented Jan 10, 2022 •

edited

codecov bot commented Jan 10, 2022 •

edited

tacaswell Jan 11, 2022

dionhaefner Jan 11, 2022

tacaswell commented Jan 11, 2022

takluyver commented Jan 11, 2022

dionhaefner commented Jan 11, 2022 •

edited

dionhaefner commented Jan 11, 2022 •

edited

takluyver commented Jan 11, 2022

dionhaefner commented Jan 11, 2022 •

edited

ajelenak commented Jan 11, 2022

dionhaefner commented Jan 11, 2022

kmuehlbauer commented Jan 19, 2022 •

edited

kmuehlbauer commented Jan 19, 2022

tacaswell commented Jan 19, 2022

ajelenak commented Jan 20, 2022 •

edited

Reduce chunksizes for unlimited dimensions first #2036

Are you sure you want to change the base?

Reduce chunksizes for unlimited dimensions first #2036

Conversation

dionhaefner commented Jan 10, 2022 • edited

codecov bot commented Jan 10, 2022 • edited

Codecov Report

tacaswell Jan 11, 2022

Choose a reason for hiding this comment

dionhaefner Jan 11, 2022

Choose a reason for hiding this comment

tacaswell commented Jan 11, 2022

takluyver commented Jan 11, 2022

dionhaefner commented Jan 11, 2022 • edited

dionhaefner commented Jan 11, 2022 • edited

takluyver commented Jan 11, 2022

dionhaefner commented Jan 11, 2022 • edited

ajelenak commented Jan 11, 2022

dionhaefner commented Jan 11, 2022

kmuehlbauer commented Jan 19, 2022 • edited

kmuehlbauer commented Jan 19, 2022

tacaswell commented Jan 19, 2022

ajelenak commented Jan 20, 2022 • edited

dionhaefner commented Jan 10, 2022 •

edited

codecov bot commented Jan 10, 2022 •

edited

dionhaefner commented Jan 11, 2022 •

edited

dionhaefner commented Jan 11, 2022 •

edited

dionhaefner commented Jan 11, 2022 •

edited

kmuehlbauer commented Jan 19, 2022 •

edited

ajelenak commented Jan 20, 2022 •

edited