New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset.points accessor #1793
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1793 +/- ##
==========================================
+ Coverage 90.06% 90.19% +0.12%
==========================================
Files 17 17
Lines 2306 2367 +61
==========================================
+ Hits 2077 2135 +58
- Misses 229 232 +3
Continue to review full report at Codecov.
|
Should we post this on the HDF5 discourse to get opinions (this looks fine to me, but I don't have a immediate use-case for it)? |
That's a good idea. I'm a similar position - I wrote this just because it seemed neat and the low-level functionality for it was already there, but I also don't actually have a use for it myself. |
I made a forum post here: https://forum.hdfgroup.org/t/read-write-specific-coordiantes-in-multi-dimensional-dataset/9137 I suggest that if no-one expresses any interest in a few more months, we close this. |
Hi @takluyver, this is exactly what I am looking for. Use Case: |
That does sound like a use case for this. 🎉 I've just resolved the merge conflicts. Can you give this branch a try - either build from source, or the CI on Azure pipelines will produce pre-built Linux/Mac/Windows wheels you can download in a few minutes. In particular, it would be good to see how this compares in speed & memory use against reading a larger sub-set of the mask (as you describe) and reading individual points one by one. Obviously there's no magic - within HDF5, it's still going to have to read and decompress all the chunks that your selected points touch. |
I finished my tests with the pre-build Linux python 3.9 wheel (nice feature of your Azure pipeline). Scenario-1: equally distributed points on a subset which covers 1/288 of the entire dataset which is totally covered by 9 chunks.
Scenario-2: 10 clusters of points on a subset which covers 1/288 of the entire dataset and is stored in 9 chunks with clusters of 1/144 the subset size.
Scenario-3: 10 clusters of points on a subset which covers 1/32 of the entire dataset and is stored in 81 chunks with clusters of 1/1296 the subset size.
Note, that I left out the single point extraction, because it already took about 7 seconds for 1000 points. The memory usage of the single point extraction was comparable with the accessor, which I think is not a big surprise. Scenario-1 shows that for small number of points, the accessor is much faster and more efficient in memory, which I think comes mostly form creating the subset. However, when the size of the points array is comparable to the size of the mask, the fancy indexing of numpy becomes faster. Interesting to me is the fact that the accessor gets slower for large number of points, which might be due to the hdf5 internal point to chunk matching algorithm (I guess the hdf5 accessor sorts the points and then open the relevant chunks which would be an N log N penalty, or after opening a chunk, it just checks which other points fall into this chunk which would become an N*(touched chunks) penalty, but this is pure speculation). In Scenario-2 we see a similar result to the first scenario, which I guess is related to the small number of chunks. The chunk size is motivated by the subset creation, so this is not too surprising to me. The fact that not all chunks had to be opened, but still the time for the accessor increases, underlines my speculation in the first scenario. In Scenario-3 the subset becomes larger while the number of clusters stays small. We can expect that the number of touched chunks is much smaller than the number of chunks required for the subset. Here, we clearly see the advantage of the points accessor. I would expect similar results in scenario-2 for smaller chunk sizes (currently 2000x2000). Conclusion: the points accessor introduced in this PR opens the possibility for interpolating points on data which does not fit into memory. |
The accessor does only support indexes in data shape bounds. The commonly used negative indexes are not supported. However, I think this is fine and should remain the user's responsibility. It should only be noted in the documentation. |
Nice, thanks for the really detailed investigation. It looks like it's most valuable when you're selecting a small number of points from across a large part of the data, which I guess makes sense. I agree that we should ensure that out-of-bounds coordinates give an IndexError. |
OK, now it should raise IndexError on out-of-bounds indexes. For now, I've done this with a tweak to the code that translates errors from HDF5 into Python exceptions, rather than doing our own bounds check. That's slightly more efficient because we're not doing the same check twice, but it does depend on HDF5 using its own error codes consistently. There's a chance it could turn some errors into IndexError that shouldn't be (HDF5 uses the same BADRANGE error number for unrecognised version numbers in stored data, for instance), but the whole error translation is kind of a guessing game anyway. 🤷 I'm on the fence about that decision, so @tacaswell @aragilar if you think we'd be better off duplicating the check so we can be sure to raise an IndexError, I'm happy to make it work that way. |
I'm in favor of doing the translation and risking the exception classes being a bit off. |
I have the impression that something very inefficient happens under the hood when using the point selector. def read_points(ds, points, output=None):
length = len(points)
if output is None:
output = np.empty(length, dtype=ds.dtype)
# group by chunks
chunks_shape = tuple(s//c for s,c in zip(ds.shape, ds.chunks))
chunk_labels = np.ravel_multi_index(
tuple(points[:, d]//ds.chunks[d] for d in range(ds.ndim)),
chunks_shape
)
sorter = np.argsort(chunk_labels)
labels, splitter = np.unique(chunk_labels[sorter], return_index=True)
splitter = splitter.tolist()
splitter.append(length)
# load chunks
tmp = np.empty(ds.chunks, dtype=ds.dtype)
for label, smin, smax in zip(labels, splitter[:-1], splitter[1:]):
chunk_idx = np.unravel_index(label, chunks_shape)
lower = np.array([ds.chunks[d]*chunk_idx[d] for d in range(ds.ndim)])
selection = tuple(
slice(lower[d], lower[d] + ds.chunks[d])
for d in range(ds.ndim)
)
ds.read_direct(tmp, source_sel=selection)
index = sorter[smin:smax]
tmp_idx = points[index] - lower
output[index] = tmp[tuple(tmp_idx.T)]
return output |
Do you have timings to show how big the difference is? In principle HDF5 should be able to do something very similar internally, but unfortunately it wouldn't be the only case where a Python solution can be faster than letting HDF5 do something. Your code would only work for chunked datasets, so it would still need the HDF5 mechanism for other types of data storage. |
I felt the penalty when I started working with entire orbits, where my previous approach of loading the relevant part using min max index did not work out anymore, because it would basically load the entire data set. So I went back to the point selector, but I had the impression that is quite slow. Therefore, I came up with the above solution which has a predictable memory consumption. In order to have something reproducible to share with you, I did some timings with an artificial data set. The observed trend is the same for my real-world data. import timeit
import pathlib
import numpy as np
import h5py
import h5py._hl.selections as sel
def read_points_chunked(ds, points, output=None):
... # see above
def read_points_selector(ds, points, output=None):
length = len(points)
if output is None:
output = np.empty(length, dtype=ds.dtype)
ps = sel.PointSelection(ds.shape, ds.id.get_space(), points)
ds.read_direct(output, source_sel=ps)
return output
# create file
chunks = (1024, 1024)
nj_chunks = 16
ni_chunks = 32
path = pathlib.Path('test.h5')
if not path.is_file():
with h5py.File(path, mode='w') as h5:
ds = h5.create_dataset(
'data', (nj_chunks*chunks[0], ni_chunks*chunks[1]),
dtype=np.uint16, chunks=chunks,
compression="gzip", compression_opts=9
)
for j in range(nj_chunks):
for i in range(ni_chunks):
jmin = j*chunks[0]
jmax = jmin + chunks[0]
imin = i*chunks[1]
imax = imin + chunks[1]
ds[jmin:jmax, imin:imax] = np.ravel_multi_index(
(j, i), (nj_chunks, ni_chunks)
)
# create points
N = 50000
jj = np.random.randint(8*chunks[0], 11*chunks[0], N)
ii = np.random.randint(17*chunks[1], 20*chunks[1], N)
points = np.column_stack((jj, ii))
# time it
code_chunked = """
with h5py.File(path, mode='r') as h5:
ds = h5['data']
output = read_points_chunked(ds, points)
"""
timer = timeit.Timer(code_chunked, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_chunked = min(raw_timings) / number
code_selector = """
with h5py.File(path, mode='r') as h5:
ds = h5['data']
output = read_points_selector(ds, points)
"""
timer = timeit.Timer(code_selector, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_selector = min(raw_timings) / number
# print best results
print(f"Chunked: {best_chunked:.3f} seconds")
print(f"Selector: {best_selector:.3f} seconds") Here I get the following result
|
This provides a simple, high-level way to read/write data at a list of coordinates within a dataset:
The points are specified as an (npoints, ndim) array, or something which
asarray()
will convert to that, like a list of tuples. The core functionality for this (PointSelection
) was already there, but not really exposed in the public API (except by creating a boolean array to select points).Closes #1602
At the moment, this probably only handles the very simplest cases. I guess I'm hoping for some feedback before thinking about all the corner cases.