Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce chunksizes for unlimited dimensions first #2036

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dionhaefner
Copy link

@dionhaefner dionhaefner commented Jan 10, 2022

Fixes #2029. See this issue for discussion / background on this change. In a nutshell, this prevents the presence of unlimited dimensions from blowing up resulting file sizes (in cases where unlimited dimensions stay small).

Some sanity checks that illustrate how the new method works:

>>> from h5py._hl.filters import guess_chunk
>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (32, 16, 16, 8)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (1, 16, 32, 15)
>>> guess_chunk((0,), None, 8)
(1024,)  # master gives (1024,)
>>> guess_chunk((0,) * 10, None, 8)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2)  # master gives (2, 2, 2, 2, 4, 4, 4, 4, 4, 4)
>>> guess_chunk((0, 1), None, 8)
(1024, 1)  # master gives (1024, 1)
>>> guess_chunk((2,) * 20 + (0,), None, 8)
(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)
# master gives (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1024)

Note how the last case would lead to a 1024x larger file on master.

Let me know if you want me to add additional unit tests.

@codecov
Copy link

codecov bot commented Jan 10, 2022

Codecov Report

Merging #2036 (2930f91) into master (aa31f03) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2036      +/-   ##
==========================================
+ Coverage   89.90%   89.91%   +0.01%     
==========================================
  Files          17       17              
  Lines        2307     2310       +3     
==========================================
+ Hits         2074     2077       +3     
  Misses        233      233              
Impacted Files Coverage Δ
h5py/_hl/filters.py 92.96% <100.00%> (+0.10%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa31f03...2930f91. Read the comment docs.

if not np.all(np.isfinite(chunks)):
raise ValueError("Illegal value in chunk tuple")

# Determine the optimal chunk size in bytes using a PyTables expression.
# This is kept as a float.
dset_size = np.product(chunks)*typesize
dset_size = np.product(chunks[~is_unlimited])*typesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you do not exclude the unlimited dimensions from this target? This is currently producing much smaller chunks than master branch does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be nice for the chunks to match in this case:

>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (32, 16, 16, 8)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15)  # master gives (1, 16, 32, 15)

So this way for large fixed dimensions, adding an unlimited dimension is the same as adding a fixed dimension of size 1 (in terms of the resulting chunks).

Without this, you get that:

>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 32, 63, 30)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15)
>>> guess_chunk((0,), None, 8)
(1024,)
>>> guess_chunk((0,) * 10, None, 8)
(2, 2, 2, 2, 4, 4, 4, 4, 4, 4)
>>> guess_chunk((0, 1), None, 8)
(1024, 1)
>>> guess_chunk((2,) * 20 + (0,), None, 8)
(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)

Not bad either, so if you prefer that we can change it.

@tacaswell
Copy link
Member

Note how the last case would lead to a 1024x larger file on master.

I think that is only the case if you only put 1 element along the unlimited dimension, the more you have the cost gets amortized (but you probably get weird step ups when you cross to needing another chunk)?

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

@takluyver
Copy link
Member

Note how the last case would lead to a 1024x larger file on master.

I think that is only the case if you only put 1 element along the unlimited dimension, the more you have the cost gets amortized

Yup, I certainly wouldn't consider it a typical use case to define an unlimited dimension and then only add one entry on that dimension, so wasted space if you do that and accept h5py's guess for chunk shape isn't a big concern.

A 21-dimensional dataset is also unrealistic. It's definitely worth exploring how it behaves in extreme cases like this, but I would focus on making good guesses for up to maybe 5-6 dimensions.

The interesting case for this change is patterns in how you read the data. For the use cases I've seen at EuXFEL, it's pretty common to read a full frame for a single point along the unlimited axis - e.g. ds[123, :, :] if the first axis is unlimited. Having a chunk shape like (1, 256, 256) for that example means that reads in that pattern need a single chunk, and then use it entirely, so they're more efficient. A read like ds[:, 0, 0] would be very inefficient with that chunk, though - it would have to load every chunk and then throw away all but one number.

I'm still somewhat wary about this. Any guess at the right chunk shape is going to be wrong for someone - maybe it's better to keep it consistent than to try to be smarter. Treating all dimensions equally also reduces the risk that the chunking is catastrophically bad for any particular access pattern. On the other hand, this would guess chunks similar to the shapes our datasets use by design. 🤷

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

I'm not sure exactly what HDF5 does, but I think this is at least mitigated by chunk caching. Ideally, the chunk can live in the cache while you modify it, and only be written out once you're finished. But that assumes the cache is big enough to hold all the chunks you're modifying.

@dionhaefner
Copy link
Author

dionhaefner commented Jan 11, 2022

That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size.

Yes, that seems to be what's happening. Not file size, but compute! It looks like the space with the outdated data is reclaimed and re-compressed, leading to a steady increase of time to write a slice (while also starting 10x slower than with reasonable chunks). I also see a factor of 10 difference in writing speed when not using compression, but it doesn't get worse over time.

master:

using chunks: (16, 16, 3, 64)
i: 1, size: 13MB, time: 3.044s
i: 2, size: 22MB, time: 3.538s
i: 3, size: 31MB, time: 3.639s
i: 4, size: 39MB, time: 3.792s
i: 5, size: 48MB, time: 3.974s
i: 6, size: 57MB, time: 4.303s
i: 7, size: 66MB, time: 4.594s
i: 8, size: 74MB, time: 4.682s
i: 9, size: 83MB, time: 4.904s
i: 10, size: 92MB, time: 5.142s
i: 11, size: 100MB, time: 5.559s
i: 12, size: 109MB, time: 5.688s
i: 13, size: 118MB, time: 6.008s
i: 14, size: 126MB, time: 6.282s
i: 15, size: 135MB, time: 6.516s
i: 16, size: 144MB, time: 6.720s
i: 17, size: 152MB, time: 7.305s
i: 18, size: 161MB, time: 7.530s
i: 19, size: 170MB, time: 7.666s
i: 20, size: 179MB, time: 8.122s

This PR:

using chunks: (32, 32, 10, 1)
i: 1, size: 8MB, time: 0.327s
i: 2, size: 17MB, time: 0.322s
i: 3, size: 25MB, time: 0.312s
i: 4, size: 34MB, time: 0.308s
i: 5, size: 42MB, time: 0.308s
i: 6, size: 51MB, time: 0.330s
i: 7, size: 60MB, time: 0.320s
i: 8, size: 68MB, time: 0.324s
i: 9, size: 77MB, time: 0.314s
i: 10, size: 85MB, time: 0.327s
i: 11, size: 94MB, time: 0.324s
i: 12, size: 103MB, time: 0.314s
i: 13, size: 111MB, time: 0.320s
i: 14, size: 120MB, time: 0.316s
i: 15, size: 128MB, time: 0.320s
i: 16, size: 137MB, time: 0.323s
i: 17, size: 146MB, time: 0.314s
i: 18, size: 154MB, time: 0.315s
i: 19, size: 163MB, time: 0.312s
i: 20, size: 171MB, time: 0.317s

Test script:

import os
import tempfile
import time

import h5py

import numpy as np


SHAPE = (250, 250, 40, 0)

KWARGS = {
    "compression": "gzip",
}


def write_test_file(outfile):
    with h5py.File(outfile, "w") as f:
        d = f.create_dataset(
            "test",
            shape=SHAPE,
            maxshape=tuple(s or None for s in SHAPE),
            chunks=True,
            **KWARGS
        )
        print(f"using chunks: {d.chunks}")

        f.flush()

        for i in range(1, 100):
            start = time.perf_counter()
            d.resize(i, axis=3)
            d[..., i-1] = np.random.rand(*d.shape[:-1])
            f.flush()
            stop = time.perf_counter()

            print(f"i: {i}, size: {get_filesize_mb(outfile)}MB, time: {stop - start:.3f}s")


def get_filesize_mb(infile):
    return os.stat(infile).st_size // 1024 ** 2


if __name__ == "__main__":
    outfile = tempfile.NamedTemporaryFile(delete=False)
    outfile.close()

    try:
        write_test_file(outfile.name)
    finally:
        os.remove(outfile.name)

@dionhaefner
Copy link
Author

dionhaefner commented Jan 11, 2022

Yup, I certainly wouldn't consider it a typical use case to define an unlimited dimension and then only add one entry on that dimension, so wasted space if you do that and accept h5py's guess for chunk shape isn't a big concern.

I beg to differ. If you have a simulation that outputs in regular intervals, every file starts out with a single entry along the unlimited (time) axis. Often this is amortized by running the simulation longer, often it's not. If you need more convincing, look at the timings above. A 10x difference in writing surely matters?

The interesting case for this change is patterns in how you read the data. For the use cases I've seen at EuXFEL, it's pretty common to read a full frame for a single point along the unlimited axis - e.g. ds[123, :, :] if the first axis is unlimited. Having a chunk shape like (1, 256, 256) for that example means that reads in that pattern need a single chunk, and then use it entirely, so they're more efficient. A read like ds[:, 0, 0] would be very inefficient with that chunk, though - it would have to load every chunk and then throw away all but one number.

I agree that this is an issue, so the performance implications should be well documented. If the user anticipates that people will read "time series" from their dataset, IMO they should either supply chunks manually or even re-chunk after writing (given that suboptimal chunks can slow you down by a factor of 10).

@takluyver
Copy link
Member

If you need more convincing, look at the timings above. A 10x difference in writing surely matters?

Yup, the timings are a more compelling argument to my mind than the possibility of wasted space. But the demo is showing what I'd expect anyway: performance is better when your reads or writes line up nicely with the chunk shape. You're doing writes in a pattern that fits your assumption, so naturally using a chunk shape based on that assumption gives better performance.

The questions that I'm trying to get at are:

  1. How good is the assumption that reads/writes will usually access a small slice of the unlimited dimensions and a large slice of the limited ones? It holds for your use case and (mostly) for the ones I've seen at work, but are there other realistic scenarios which would be made worse? This is hard to answer, because someone negatively affected by a change like this will probably only tell us once we've merged and released it.
  2. How do we balance 'good for the most likely use case' vs 'not terrible for any use case'? You've shown that one particular access pattern is 10x slower than it could be with the current guess. But maybe something that's 10x slower than ideal for any extreme (e.g. 'frame' access vs 'timeseries' access) is better than something that's ideal for one extreme but 100x or 1000x slower for the other? 🤔
  3. How do we balance 'potential improvement' vs 'least surprise'? If someone is implicitly relying on the current chunking heuristic, how cautious should we be about changing it and potentially breaking what they're doing?

the performance implications should be well documented. If the user anticipates that people will read "time series" from your dataset they should either supply chunks manually or even re-chunk after writing

Documenting things clearly is good, but no matter how well we do that, a lot of users are not going to read it. Especially for something like this - a user might want compression or resizing and not even realise that that means their data will be broken up into chunks, or that h5py is guessing how best to do that. If we think the change in behaviour requires a warning in the docs to think about when you might want to override it, that could be a red flag that we should leave it alone.

I'm not as negative on this change as this all sounds. 😉 I think it's probably an improvement in most cases that would be affected. But I want to think through this stuff, because I've seen enough cases on various open source projects where something that seemed like an improvement turned out to cause another problem, and it only became clear after a release.

@dionhaefner
Copy link
Author

dionhaefner commented Jan 11, 2022

I agree with your assessment, except this point:

If we think the change in behaviour requires a warning in the docs to think about when you might want to override it, that could be a red flag that we should leave it alone.

IMO it already requires a warning in the current state, just a different one.

Anyhow, we got these arguments pro:

  • In the best case, 1024 times faster reads (when reading slices along the unlimited axes, which so far everyone agreed is the most likely use case)
  • Unconditionally faster writes (up to 1024 times faster in best case, same performance in worst case when writing nchunk slices at once)
  • Unconditionally smaller files (same condition as above)

And these arguments con:

  • In the worst case, 1024 times slower reads (when reading scalar slices across the unlimited axes)
  • Slightly slower reads in the "mixed" case (partial read / slice of all dimensions)
  • A change will create friction with users who rely on the current behavior (but also miraculously speed things up for others)

I'm not going to nudge you more than that, if you think it's not worth it that's OK.

@ajelenak
Copy link
Contributor

The discussion so far illustrates the typical challenges of coming up with an acceptable "I am feeling lucky" chunk shape algorithm.

May I suggest to first address the 1024*1024, i.e. CHUNK_MAX, value in the current code? That value seems to be the default dataset chunk cache size, which is both too old and too small for today's computers. Even if someone could change the cache size, the guessing algorithm would not be aware of that and produce smaller chunk shapes.

How about we set the dataset chunk cache to 16 MiB? And modify the code to get the cache size, rather than use a hard-coded value?

@dionhaefner
Copy link
Author

How about we set the dataset chunk cache to 16 MiB? And modify the code to get the cache size, rather than use a hard-coded value?

To add 1 data point to this, I did some quick and dirty benchmarks of this vs. the default and found that the default is ~10% faster (h5netcdf/h5netcdf#127 (comment)). Could be that this would be a better default but to me this sounds like another can of worms that would require a lot of benchmarking across different systems and setups. Also it doesn't solve the problem that this PR addresses.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 19, 2022

Thanks for the great discussion. It's really good to change perspective every once in a while.

There might be another option to resolve the problem we have over at h5netcdf/h5netcdf#127 on the h5py-side.

Would it be possible to create a dataset giving chunks like this?

dset = f.create_dataset("chunked", (1, 1000), maxshape=(None, 1000), chunks=(1, None))

guess_chunk could treat the dimensions which are given with a number as preset and only autochunk the remaining (None) dimensions.

h5py/h5py/_hl/filters.py

Lines 247 to 250 in aa31f03

if (chunks is True) or \
(chunks is None and any((shuffle, fletcher32, compression, maxshape,
scaleoffset is not None))):
chunks = guess_chunk(shape, maxshape, dtype.itemsize)

This would really help. Much better than reimplementing yet another autochunker in h5netcdf, which could divert over the time from it's parent (see comment in guess_chunks - Undocumented and subject to change without warning.)

@kmuehlbauer
Copy link
Contributor

@takluyver

I'm not as negative on this change as this all sounds. wink I think it's probably an improvement in most cases that would be affected. But I want to think through this stuff, because I've seen enough cases on various open source projects where something that seemed like an improvement turned out to cause another problem, and it only became clear after a release.

I fully agree. An enhancement like along the lines of my above suggestion would prevent this problem, but give downstream users more freedom when choosing chunk sizes of their liking. I'm just unsure if it makes sense at all.

@tacaswell
Copy link
Member

guess_chunk could treat the dimensions which are given with a number as preset and only autochunk the remaining (None) dimensions.

This seems like a good idea to me as well.

@ajelenak
Copy link
Contributor

ajelenak commented Jan 20, 2022

I think it would be really helpful to define some overall goals for the new guess_chunk. For example:

  1. Should it favor data creators or data analysts? (My vote is for the latter.)
  2. Should it take into account the actual size of the dataset's chunk cache and should this default cache be increased? (Yes, yes.) Maybe we should add dataset chunk cache size as a parameter to create_dataset?

Data analysts would benefit from a more balanced chunk shape in all dimensions as their dataset read access is more random. A larger dataset chunk cache would enable larger chunk sizes and thus shapes. Data creators don't care much for more than one chunk in the cache because they are writing out data, so larger cache would allow keeping even those balanced chunks in the cache so they can fill them "by one" along some dimension if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default chunksize with unlimited dimensions leads to huge output files
5 participants