Add dataset.points accessor #1793

takluyver · 2021-01-15T14:09:08Z

This provides a simple, high-level way to read/write data at a list of coordinates within a dataset:

arr = ds.points[[(0, 0), (1, 0), (2, 2)]]  # Read

ds.points[[(0, 0), (1, 0), (2, 2)]] = [1, 2, 3]  # Write

The points are specified as an (npoints, ndim) array, or something which asarray() will convert to that, like a list of tuples. The core functionality for this (PointSelection) was already there, but not really exposed in the public API (except by creating a boolean array to select points).

Closes #1602

At the moment, this probably only handles the very simplest cases. I guess I'm hoping for some feedback before thinking about all the corner cases.

codecov · 2021-01-15T14:23:20Z

Codecov Report

Merging #1793 (d266559) into master (d570af6) will increase coverage by 0.12%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master    #1793      +/-   ##
==========================================
+ Coverage   90.06%   90.19%   +0.12%     
==========================================
  Files          17       17              
  Lines        2306     2367      +61     
==========================================
+ Hits         2077     2135      +58     
- Misses        229      232       +3

Impacted Files	Coverage Δ
h5py/_hl/dataset.py	`93.02% <94.44%> (-0.12%)`	⬇️
h5py/_hl/base.py	`95.91% <0.00%> (-0.27%)`	⬇️
h5py/_hl/group.py	`96.94% <0.00%> (+0.15%)`	⬆️
h5py/_hl/files.py	`90.38% <0.00%> (+0.22%)`	⬆️
h5py/_hl/selections.py	`88.00% <0.00%> (+0.50%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d570af6...d266559. Read the comment docs.

aragilar · 2021-11-02T09:30:08Z

Should we post this on the HDF5 discourse to get opinions (this looks fine to me, but I don't have a immediate use-case for it)?

takluyver · 2021-11-06T18:04:14Z

That's a good idea. I'm a similar position - I wrote this just because it seemed neat and the low-level functionality for it was already there, but I also don't actually have a use for it myself.

takluyver · 2021-11-06T20:07:19Z

I made a forum post here: https://forum.hdfgroup.org/t/read-write-specific-coordiantes-in-multi-dimensional-dataset/9137

I suggest that if no-one expresses any interest in a few more months, we close this.

carloshorn · 2022-02-09T18:44:19Z

Hi @takluyver, this is exactly what I am looking for.

Use Case:
I have a land-see mask (2d bool array) on an equally spaced latitude and longitude grid at high resolution and therefore stored in compressed 2d chunks.
I want to determine if a collection of latitude-longitude points are on land or water using next-neighbor interpolation, which for equally spaced grids is basically just an integer division by the spacing to get the array index in the corresponding coordinate.
In order to evaluate the points, I currently have to load a sub-set of the mask into memory (the points are not spread among the entire globe, so I can slice the min:max index in each dimension) to do the fancy indexing.
This feature would make loading a sub-set obsolete.
The alternative solution of reading one point at a time is not an option, because it would be too slow.

for more information, see https://pre-commit.ci

takluyver · 2022-02-10T11:52:12Z

That does sound like a use case for this. 🎉

I've just resolved the merge conflicts. Can you give this branch a try - either build from source, or the CI on Azure pipelines will produce pre-built Linux/Mac/Windows wheels you can download in a few minutes.

In particular, it would be good to see how this compares in speed & memory use against reading a larger sub-set of the mask (as you describe) and reading individual points one by one. Obviously there's no magic - within HDF5, it's still going to have to read and decompress all the chunks that your selected points touch.

carloshorn · 2022-02-11T11:38:38Z

I finished my tests with the pre-build Linux python 3.9 wheel (nice feature of your Azure pipeline).

Scenario-1: equally distributed points on a subset which covers 1/288 of the entire dataset which is totally covered by 9 chunks.

N-points	time subset [ms]	time accessor [ms]	memory subset [MiB]	memory accessor [MiB]
1e3	79.9	66.7	124.66	90.60
1e4	81.9	72.2	126.00	91.73
1e5	81.2	125	140.63	106.41
1e6	143	898	242.53	241.53

Scenario-2: 10 clusters of points on a subset which covers 1/288 of the entire dataset and is stored in 9 chunks with clusters of 1/144 the subset size.

N-points	time subset [ms]	time accessor [ms]	memory subset [MiB]	memory accessor [MiB]
1e3	71.4	60.3	114.58	114.62
1e4	74.0	50.4	114.66	114.66
1e5	84.3	101	130.65	130.66
1e6	132	471	206.71	208.71

Scenario-3: 10 clusters of points on a subset which covers 1/32 of the entire dataset and is stored in 81 chunks with clusters of 1/1296 the subset size.

N-points	time subset [ms]	time accessor [ms]	memory subset [MiB]	memory accessor [MiB]
1e3	723	82.4	356.08	93.15
1e4	664	126	297.68	93.53
1e5	671	134	320.45	101.70
1e6	945	487	449.56	202.76

Note, that I left out the single point extraction, because it already took about 7 seconds for 1000 points. The memory usage of the single point extraction was comparable with the accessor, which I think is not a big surprise.

Scenario-1 shows that for small number of points, the accessor is much faster and more efficient in memory, which I think comes mostly form creating the subset. However, when the size of the points array is comparable to the size of the mask, the fancy indexing of numpy becomes faster. Interesting to me is the fact that the accessor gets slower for large number of points, which might be due to the hdf5 internal point to chunk matching algorithm (I guess the hdf5 accessor sorts the points and then open the relevant chunks which would be an N log N penalty, or after opening a chunk, it just checks which other points fall into this chunk which would become an N*(touched chunks) penalty, but this is pure speculation).

In Scenario-2 we see a similar result to the first scenario, which I guess is related to the small number of chunks. The chunk size is motivated by the subset creation, so this is not too surprising to me. The fact that not all chunks had to be opened, but still the time for the accessor increases, underlines my speculation in the first scenario.

In Scenario-3 the subset becomes larger while the number of clusters stays small. We can expect that the number of touched chunks is much smaller than the number of chunks required for the subset. Here, we clearly see the advantage of the points accessor. I would expect similar results in scenario-2 for smaller chunk sizes (currently 2000x2000).

Conclusion: the points accessor introduced in this PR opens the possibility for interpolating points on data which does not fit into memory.
For my use-case which is the extraction of land see masks for landmarks, I would also like to annotate, that this feature would allow for a different algorithm design. Instead of extracting the mask for each landmark individually, I could directly create the masks for all landmarks seen by the satellite on an orbit at once (inversion of loop, similar to vectorization), but I have not tried yet, if this gives an improvement.

carloshorn · 2022-02-11T15:00:35Z

At the moment, this probably only handles the very simplest cases. I guess I'm hoping for some feedback before thinking about all the corner cases.

The accessor does only support indexes in data shape bounds. The commonly used negative indexes are not supported. However, I think this is fine and should remain the user's responsibility. It should only be noted in the documentation.
Still, the raised error type could be changed from OSError to IndexError.
If other people want to use this feature like me for interpolation, they will have to worry about the handling of boundary values anyway. Therefore, these users will always be able to prepare the points in a way that fulfils this index condition.

takluyver · 2022-02-11T17:57:08Z

Nice, thanks for the really detailed investigation. It looks like it's most valuable when you're selecting a small number of points from across a large part of the data, which I guess makes sense.

I agree that we should ensure that out-of-bounds coordinates give an IndexError.

takluyver · 2022-06-16T10:28:40Z

OK, now it should raise IndexError on out-of-bounds indexes.

For now, I've done this with a tweak to the code that translates errors from HDF5 into Python exceptions, rather than doing our own bounds check. That's slightly more efficient because we're not doing the same check twice, but it does depend on HDF5 using its own error codes consistently. There's a chance it could turn some errors into IndexError that shouldn't be (HDF5 uses the same BADRANGE error number for unrecognised version numbers in stored data, for instance), but the whole error translation is kind of a guessing game anyway. 🤷

I'm on the fence about that decision, so @tacaswell @aragilar if you think we'd be better off duplicating the check so we can be sure to raise an IndexError, I'm happy to make it work that way.

tacaswell · 2022-06-17T01:18:50Z

I'm in favor of doing the translation and risking the exception classes being a bit off.

carloshorn · 2022-07-22T20:21:12Z

I have the impression that something very inefficient happens under the hood when using the point selector.
Here, I would like to share my solution to access points, which could be used as getter.

def read_points(ds, points, output=None):
    length = len(points)
    if output is None:
        output = np.empty(length, dtype=ds.dtype)
    # group by chunks
    chunks_shape = tuple(s//c for s,c in zip(ds.shape, ds.chunks))
    chunk_labels = np.ravel_multi_index(
        tuple(points[:, d]//ds.chunks[d] for d in range(ds.ndim)),
        chunks_shape
    )
    sorter = np.argsort(chunk_labels)
    labels, splitter = np.unique(chunk_labels[sorter], return_index=True)
    splitter = splitter.tolist()
    splitter.append(length)
    # load chunks
    tmp = np.empty(ds.chunks, dtype=ds.dtype)
    for label, smin, smax in zip(labels, splitter[:-1], splitter[1:]):
        chunk_idx = np.unravel_index(label, chunks_shape)
        lower = np.array([ds.chunks[d]*chunk_idx[d] for d in range(ds.ndim)])
        selection = tuple(
            slice(lower[d], lower[d] + ds.chunks[d])
            for d in range(ds.ndim)
        )
        ds.read_direct(tmp, source_sel=selection)
        index = sorter[smin:smax]
        tmp_idx = points[index] - lower
        output[index] = tmp[tuple(tmp_idx.T)]
    return output

takluyver · 2022-07-23T09:24:38Z

Do you have timings to show how big the difference is? In principle HDF5 should be able to do something very similar internally, but unfortunately it wouldn't be the only case where a Python solution can be faster than letting HDF5 do something.

Your code would only work for chunked datasets, so it would still need the HDF5 mechanism for other types of data storage.

carloshorn · 2022-07-23T15:30:44Z

I felt the penalty when I started working with entire orbits, where my previous approach of loading the relevant part using min max index did not work out anymore, because it would basically load the entire data set. So I went back to the point selector, but I had the impression that is quite slow. Therefore, I came up with the above solution which has a predictable memory consumption.

In order to have something reproducible to share with you, I did some timings with an artificial data set. The observed trend is the same for my real-world data.

import timeit
import pathlib
import numpy as np
import h5py
import h5py._hl.selections as sel

def read_points_chunked(ds, points, output=None):
    ...  # see above

def read_points_selector(ds, points, output=None):
    length = len(points)
    if output is None:
        output = np.empty(length, dtype=ds.dtype)
    ps = sel.PointSelection(ds.shape, ds.id.get_space(), points)
    ds.read_direct(output, source_sel=ps)
    return output

# create file
chunks = (1024, 1024)
nj_chunks = 16
ni_chunks = 32

path = pathlib.Path('test.h5')
if not path.is_file():
    with h5py.File(path, mode='w') as h5:
        ds = h5.create_dataset(
            'data', (nj_chunks*chunks[0], ni_chunks*chunks[1]), 
            dtype=np.uint16, chunks=chunks, 
            compression="gzip", compression_opts=9
        )
        for j in range(nj_chunks):
            for i in range(ni_chunks):
                jmin = j*chunks[0]
                jmax = jmin + chunks[0]
                imin = i*chunks[1]
                imax = imin + chunks[1]
                ds[jmin:jmax, imin:imax] = np.ravel_multi_index(
                    (j, i), (nj_chunks, ni_chunks)
                )
                
# create points
N = 50000
jj = np.random.randint(8*chunks[0], 11*chunks[0], N)
ii = np.random.randint(17*chunks[1], 20*chunks[1], N)
points = np.column_stack((jj, ii))

# time it
code_chunked = """
with h5py.File(path, mode='r') as h5:
    ds = h5['data']
    output = read_points_chunked(ds, points)
"""
timer = timeit.Timer(code_chunked, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_chunked = min(raw_timings) / number

code_selector = """
with h5py.File(path, mode='r') as h5:
    ds = h5['data']
    output = read_points_selector(ds, points)
"""
timer = timeit.Timer(code_selector, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_selector = min(raw_timings) / number

# print best results
print(f"Chunked: {best_chunked:.3f} seconds")
print(f"Selector: {best_selector:.3f} seconds")

Here I get the following result

Chunked: 0.030 seconds
Selector: 1.445 seconds

Add dataset.points accessor

70bed44

takluyver added the enhancement label Jan 15, 2021

takluyver and others added 2 commits February 10, 2022 11:47

Merge branch 'master' into dset-points

dd660cf

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3d7ae4

for more information, see https://pre-commit.ci

Translate out-of-bounds error from HDF5 to IndexError

d266559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset.points accessor #1793

Add dataset.points accessor #1793

takluyver commented Jan 15, 2021

codecov bot commented Jan 15, 2021 •

edited

aragilar commented Nov 2, 2021

takluyver commented Nov 6, 2021

takluyver commented Nov 6, 2021

carloshorn commented Feb 9, 2022

takluyver commented Feb 10, 2022

carloshorn commented Feb 11, 2022

carloshorn commented Feb 11, 2022

takluyver commented Feb 11, 2022

takluyver commented Jun 16, 2022

tacaswell commented Jun 17, 2022

carloshorn commented Jul 22, 2022

takluyver commented Jul 23, 2022

carloshorn commented Jul 23, 2022

Add dataset.points accessor #1793

Are you sure you want to change the base?

Add dataset.points accessor #1793

Conversation

takluyver commented Jan 15, 2021

codecov bot commented Jan 15, 2021 • edited

Codecov Report

aragilar commented Nov 2, 2021

takluyver commented Nov 6, 2021

takluyver commented Nov 6, 2021

carloshorn commented Feb 9, 2022

takluyver commented Feb 10, 2022

carloshorn commented Feb 11, 2022

carloshorn commented Feb 11, 2022

takluyver commented Feb 11, 2022

takluyver commented Jun 16, 2022

tacaswell commented Jun 17, 2022

carloshorn commented Jul 22, 2022

takluyver commented Jul 23, 2022

carloshorn commented Jul 23, 2022

codecov bot commented Jan 15, 2021 •

edited