New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce chunksizes for unlimited dimensions first #2036
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2036 +/- ##
==========================================
+ Coverage 89.90% 89.91% +0.01%
==========================================
Files 17 17
Lines 2307 2310 +3
==========================================
+ Hits 2074 2077 +3
Misses 233 233
Continue to review full report at Codecov.
|
if not np.all(np.isfinite(chunks)): | ||
raise ValueError("Illegal value in chunk tuple") | ||
|
||
# Determine the optimal chunk size in bytes using a PyTables expression. | ||
# This is kept as a float. | ||
dset_size = np.product(chunks)*typesize | ||
dset_size = np.product(chunks[~is_unlimited])*typesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you do not exclude the unlimited dimensions from this target? This is currently producing much smaller chunks than master branch does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be nice for the chunks to match in this case:
>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 16, 32, 15) # master gives (32, 16, 16, 8)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15) # master gives (1, 16, 32, 15)
So this way for large fixed dimensions, adding an unlimited dimension is the same as adding a fixed dimension of size 1 (in terms of the resulting chunks).
Without this, you get that:
>>> guess_chunk((0, 250, 250, 120), None, 8)
(1, 32, 63, 30)
>>> guess_chunk((1, 250, 250, 120), None, 8)
(1, 16, 32, 15)
>>> guess_chunk((0,), None, 8)
(1024,)
>>> guess_chunk((0,) * 10, None, 8)
(2, 2, 2, 2, 4, 4, 4, 4, 4, 4)
>>> guess_chunk((0, 1), None, 8)
(1024, 1)
>>> guess_chunk((2,) * 20 + (0,), None, 8)
(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1)
Not bad either, so if you prefer that we can change it.
I think that is only the case if you only put 1 element along the unlimited dimension, the more you have the cost gets amortized (but you probably get weird step ups when you cross to needing another chunk)? That said, one question I have about what lib hdf5 does when you update data in a chunk that is compressed. If it throws out the old chunk and allocates more space at the end that would lead to some pretty terrible line-painter explosion of file size. |
Yup, I certainly wouldn't consider it a typical use case to define an unlimited dimension and then only add one entry on that dimension, so wasted space if you do that and accept h5py's guess for chunk shape isn't a big concern. A 21-dimensional dataset is also unrealistic. It's definitely worth exploring how it behaves in extreme cases like this, but I would focus on making good guesses for up to maybe 5-6 dimensions. The interesting case for this change is patterns in how you read the data. For the use cases I've seen at EuXFEL, it's pretty common to read a full frame for a single point along the unlimited axis - e.g. I'm still somewhat wary about this. Any guess at the right chunk shape is going to be wrong for someone - maybe it's better to keep it consistent than to try to be smarter. Treating all dimensions equally also reduces the risk that the chunking is catastrophically bad for any particular access pattern. On the other hand, this would guess chunks similar to the shapes our datasets use by design. 🤷
I'm not sure exactly what HDF5 does, but I think this is at least mitigated by chunk caching. Ideally, the chunk can live in the cache while you modify it, and only be written out once you're finished. But that assumes the cache is big enough to hold all the chunks you're modifying. |
using chunks: (16, 16, 3, 64)
i: 1, size: 13MB, time: 3.044s
i: 2, size: 22MB, time: 3.538s
i: 3, size: 31MB, time: 3.639s
i: 4, size: 39MB, time: 3.792s
i: 5, size: 48MB, time: 3.974s
i: 6, size: 57MB, time: 4.303s
i: 7, size: 66MB, time: 4.594s
i: 8, size: 74MB, time: 4.682s
i: 9, size: 83MB, time: 4.904s
i: 10, size: 92MB, time: 5.142s
i: 11, size: 100MB, time: 5.559s
i: 12, size: 109MB, time: 5.688s
i: 13, size: 118MB, time: 6.008s
i: 14, size: 126MB, time: 6.282s
i: 15, size: 135MB, time: 6.516s
i: 16, size: 144MB, time: 6.720s
i: 17, size: 152MB, time: 7.305s
i: 18, size: 161MB, time: 7.530s
i: 19, size: 170MB, time: 7.666s
i: 20, size: 179MB, time: 8.122s This PR:
Test script: import os
import tempfile
import time
import h5py
import numpy as np
SHAPE = (250, 250, 40, 0)
KWARGS = {
"compression": "gzip",
}
def write_test_file(outfile):
with h5py.File(outfile, "w") as f:
d = f.create_dataset(
"test",
shape=SHAPE,
maxshape=tuple(s or None for s in SHAPE),
chunks=True,
**KWARGS
)
print(f"using chunks: {d.chunks}")
f.flush()
for i in range(1, 100):
start = time.perf_counter()
d.resize(i, axis=3)
d[..., i-1] = np.random.rand(*d.shape[:-1])
f.flush()
stop = time.perf_counter()
print(f"i: {i}, size: {get_filesize_mb(outfile)}MB, time: {stop - start:.3f}s")
def get_filesize_mb(infile):
return os.stat(infile).st_size // 1024 ** 2
if __name__ == "__main__":
outfile = tempfile.NamedTemporaryFile(delete=False)
outfile.close()
try:
write_test_file(outfile.name)
finally:
os.remove(outfile.name) |
I beg to differ. If you have a simulation that outputs in regular intervals, every file starts out with a single entry along the unlimited (time) axis. Often this is amortized by running the simulation longer, often it's not. If you need more convincing, look at the timings above. A 10x difference in writing surely matters?
I agree that this is an issue, so the performance implications should be well documented. If the user anticipates that people will read "time series" from their dataset, IMO they should either supply chunks manually or even re-chunk after writing (given that suboptimal chunks can slow you down by a factor of 10). |
Yup, the timings are a more compelling argument to my mind than the possibility of wasted space. But the demo is showing what I'd expect anyway: performance is better when your reads or writes line up nicely with the chunk shape. You're doing writes in a pattern that fits your assumption, so naturally using a chunk shape based on that assumption gives better performance. The questions that I'm trying to get at are:
Documenting things clearly is good, but no matter how well we do that, a lot of users are not going to read it. Especially for something like this - a user might want compression or resizing and not even realise that that means their data will be broken up into chunks, or that h5py is guessing how best to do that. If we think the change in behaviour requires a warning in the docs to think about when you might want to override it, that could be a red flag that we should leave it alone. I'm not as negative on this change as this all sounds. 😉 I think it's probably an improvement in most cases that would be affected. But I want to think through this stuff, because I've seen enough cases on various open source projects where something that seemed like an improvement turned out to cause another problem, and it only became clear after a release. |
I agree with your assessment, except this point:
IMO it already requires a warning in the current state, just a different one. Anyhow, we got these arguments pro:
And these arguments con:
I'm not going to nudge you more than that, if you think it's not worth it that's OK. |
The discussion so far illustrates the typical challenges of coming up with an acceptable "I am feeling lucky" chunk shape algorithm. May I suggest to first address the How about we set the dataset chunk cache to 16 MiB? And modify the code to get the cache size, rather than use a hard-coded value? |
To add 1 data point to this, I did some quick and dirty benchmarks of this vs. the default and found that the default is ~10% faster (h5netcdf/h5netcdf#127 (comment)). Could be that this would be a better default but to me this sounds like another can of worms that would require a lot of benchmarking across different systems and setups. Also it doesn't solve the problem that this PR addresses. |
Thanks for the great discussion. It's really good to change perspective every once in a while. There might be another option to resolve the problem we have over at h5netcdf/h5netcdf#127 on the Would it be possible to create a dataset giving chunks like this? dset = f.create_dataset("chunked", (1, 1000), maxshape=(None, 1000), chunks=(1, None))
Lines 247 to 250 in aa31f03
This would really help. Much better than reimplementing yet another autochunker in |
I fully agree. An enhancement like along the lines of my above suggestion would prevent this problem, but give downstream users more freedom when choosing chunk sizes of their liking. I'm just unsure if it makes sense at all. |
This seems like a good idea to me as well. |
I think it would be really helpful to define some overall goals for the new guess_chunk. For example:
Data analysts would benefit from a more balanced chunk shape in all dimensions as their dataset read access is more random. A larger dataset chunk cache would enable larger chunk sizes and thus shapes. Data creators don't care much for more than one chunk in the cache because they are writing out data, so larger cache would allow keeping even those balanced chunks in the cache so they can fill them "by one" along some dimension if desired. |
Fixes #2029. See this issue for discussion / background on this change. In a nutshell, this prevents the presence of unlimited dimensions from blowing up resulting file sizes (in cases where unlimited dimensions stay small).
Some sanity checks that illustrate how the new method works:
Note how the last case would lead to a 1024x larger file on master.
Let me know if you want me to add additional unit tests.