Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset.points accessor #1793

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

takluyver
Copy link
Member

This provides a simple, high-level way to read/write data at a list of coordinates within a dataset:

arr = ds.points[[(0, 0), (1, 0), (2, 2)]]  # Read

ds.points[[(0, 0), (1, 0), (2, 2)]] = [1, 2, 3]  # Write

The points are specified as an (npoints, ndim) array, or something which asarray() will convert to that, like a list of tuples. The core functionality for this (PointSelection) was already there, but not really exposed in the public API (except by creating a boolean array to select points).

Closes #1602

At the moment, this probably only handles the very simplest cases. I guess I'm hoping for some feedback before thinking about all the corner cases.

@codecov
Copy link

codecov bot commented Jan 15, 2021

Codecov Report

Merging #1793 (d266559) into master (d570af6) will increase coverage by 0.12%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master    #1793      +/-   ##
==========================================
+ Coverage   90.06%   90.19%   +0.12%     
==========================================
  Files          17       17              
  Lines        2306     2367      +61     
==========================================
+ Hits         2077     2135      +58     
- Misses        229      232       +3     
Impacted Files Coverage Δ
h5py/_hl/dataset.py 93.02% <94.44%> (-0.12%) ⬇️
h5py/_hl/base.py 95.91% <0.00%> (-0.27%) ⬇️
h5py/_hl/group.py 96.94% <0.00%> (+0.15%) ⬆️
h5py/_hl/files.py 90.38% <0.00%> (+0.22%) ⬆️
h5py/_hl/selections.py 88.00% <0.00%> (+0.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d570af6...d266559. Read the comment docs.

@aragilar
Copy link
Member

aragilar commented Nov 2, 2021

Should we post this on the HDF5 discourse to get opinions (this looks fine to me, but I don't have a immediate use-case for it)?

@takluyver
Copy link
Member Author

That's a good idea. I'm a similar position - I wrote this just because it seemed neat and the low-level functionality for it was already there, but I also don't actually have a use for it myself.

@takluyver
Copy link
Member Author

I made a forum post here: https://forum.hdfgroup.org/t/read-write-specific-coordiantes-in-multi-dimensional-dataset/9137

I suggest that if no-one expresses any interest in a few more months, we close this.

@carloshorn
Copy link

Hi @takluyver, this is exactly what I am looking for.

Use Case:
I have a land-see mask (2d bool array) on an equally spaced latitude and longitude grid at high resolution and therefore stored in compressed 2d chunks.
I want to determine if a collection of latitude-longitude points are on land or water using next-neighbor interpolation, which for equally spaced grids is basically just an integer division by the spacing to get the array index in the corresponding coordinate.
In order to evaluate the points, I currently have to load a sub-set of the mask into memory (the points are not spread among the entire globe, so I can slice the min:max index in each dimension) to do the fancy indexing.
This feature would make loading a sub-set obsolete.
The alternative solution of reading one point at a time is not an option, because it would be too slow.

@takluyver
Copy link
Member Author

That does sound like a use case for this. 🎉

I've just resolved the merge conflicts. Can you give this branch a try - either build from source, or the CI on Azure pipelines will produce pre-built Linux/Mac/Windows wheels you can download in a few minutes.

In particular, it would be good to see how this compares in speed & memory use against reading a larger sub-set of the mask (as you describe) and reading individual points one by one. Obviously there's no magic - within HDF5, it's still going to have to read and decompress all the chunks that your selected points touch.

@carloshorn
Copy link

I finished my tests with the pre-build Linux python 3.9 wheel (nice feature of your Azure pipeline).

Scenario-1: equally distributed points on a subset which covers 1/288 of the entire dataset which is totally covered by 9 chunks.

N-points time subset [ms] time accessor [ms] memory subset [MiB] memory accessor [MiB]
1e3 79.9 66.7 124.66 90.60
1e4 81.9 72.2 126.00 91.73
1e5 81.2 125 140.63 106.41
1e6 143 898 242.53 241.53

Scenario-2: 10 clusters of points on a subset which covers 1/288 of the entire dataset and is stored in 9 chunks with clusters of 1/144 the subset size.

N-points time subset [ms] time accessor [ms] memory subset [MiB] memory accessor [MiB]
1e3 71.4 60.3 114.58 114.62
1e4 74.0 50.4 114.66 114.66
1e5 84.3 101 130.65 130.66
1e6 132 471 206.71 208.71

Scenario-3: 10 clusters of points on a subset which covers 1/32 of the entire dataset and is stored in 81 chunks with clusters of 1/1296 the subset size.

N-points time subset [ms] time accessor [ms] memory subset [MiB] memory accessor [MiB]
1e3 723 82.4 356.08 93.15
1e4 664 126 297.68 93.53
1e5 671 134 320.45 101.70
1e6 945 487 449.56 202.76

Note, that I left out the single point extraction, because it already took about 7 seconds for 1000 points. The memory usage of the single point extraction was comparable with the accessor, which I think is not a big surprise.

Scenario-1 shows that for small number of points, the accessor is much faster and more efficient in memory, which I think comes mostly form creating the subset. However, when the size of the points array is comparable to the size of the mask, the fancy indexing of numpy becomes faster. Interesting to me is the fact that the accessor gets slower for large number of points, which might be due to the hdf5 internal point to chunk matching algorithm (I guess the hdf5 accessor sorts the points and then open the relevant chunks which would be an N log N penalty, or after opening a chunk, it just checks which other points fall into this chunk which would become an N*(touched chunks) penalty, but this is pure speculation).

In Scenario-2 we see a similar result to the first scenario, which I guess is related to the small number of chunks. The chunk size is motivated by the subset creation, so this is not too surprising to me. The fact that not all chunks had to be opened, but still the time for the accessor increases, underlines my speculation in the first scenario.

In Scenario-3 the subset becomes larger while the number of clusters stays small. We can expect that the number of touched chunks is much smaller than the number of chunks required for the subset. Here, we clearly see the advantage of the points accessor. I would expect similar results in scenario-2 for smaller chunk sizes (currently 2000x2000).


Conclusion: the points accessor introduced in this PR opens the possibility for interpolating points on data which does not fit into memory.
For my use-case which is the extraction of land see masks for landmarks, I would also like to annotate, that this feature would allow for a different algorithm design. Instead of extracting the mask for each landmark individually, I could directly create the masks for all landmarks seen by the satellite on an orbit at once (inversion of loop, similar to vectorization), but I have not tried yet, if this gives an improvement.

@carloshorn
Copy link

At the moment, this probably only handles the very simplest cases. I guess I'm hoping for some feedback before thinking about all the corner cases.

The accessor does only support indexes in data shape bounds. The commonly used negative indexes are not supported. However, I think this is fine and should remain the user's responsibility. It should only be noted in the documentation.
Still, the raised error type could be changed from OSError to IndexError.
If other people want to use this feature like me for interpolation, they will have to worry about the handling of boundary values anyway. Therefore, these users will always be able to prepare the points in a way that fulfils this index condition.

@takluyver
Copy link
Member Author

Nice, thanks for the really detailed investigation. It looks like it's most valuable when you're selecting a small number of points from across a large part of the data, which I guess makes sense.

I agree that we should ensure that out-of-bounds coordinates give an IndexError.

@takluyver
Copy link
Member Author

OK, now it should raise IndexError on out-of-bounds indexes.

For now, I've done this with a tweak to the code that translates errors from HDF5 into Python exceptions, rather than doing our own bounds check. That's slightly more efficient because we're not doing the same check twice, but it does depend on HDF5 using its own error codes consistently. There's a chance it could turn some errors into IndexError that shouldn't be (HDF5 uses the same BADRANGE error number for unrecognised version numbers in stored data, for instance), but the whole error translation is kind of a guessing game anyway. 🤷

I'm on the fence about that decision, so @tacaswell @aragilar if you think we'd be better off duplicating the check so we can be sure to raise an IndexError, I'm happy to make it work that way.

@tacaswell
Copy link
Member

I'm in favor of doing the translation and risking the exception classes being a bit off.

@carloshorn
Copy link

I have the impression that something very inefficient happens under the hood when using the point selector.
Here, I would like to share my solution to access points, which could be used as getter.

def read_points(ds, points, output=None):
    length = len(points)
    if output is None:
        output = np.empty(length, dtype=ds.dtype)
    # group by chunks
    chunks_shape = tuple(s//c for s,c in zip(ds.shape, ds.chunks))
    chunk_labels = np.ravel_multi_index(
        tuple(points[:, d]//ds.chunks[d] for d in range(ds.ndim)),
        chunks_shape
    )
    sorter = np.argsort(chunk_labels)
    labels, splitter = np.unique(chunk_labels[sorter], return_index=True)
    splitter = splitter.tolist()
    splitter.append(length)
    # load chunks
    tmp = np.empty(ds.chunks, dtype=ds.dtype)
    for label, smin, smax in zip(labels, splitter[:-1], splitter[1:]):
        chunk_idx = np.unravel_index(label, chunks_shape)
        lower = np.array([ds.chunks[d]*chunk_idx[d] for d in range(ds.ndim)])
        selection = tuple(
            slice(lower[d], lower[d] + ds.chunks[d])
            for d in range(ds.ndim)
        )
        ds.read_direct(tmp, source_sel=selection)
        index = sorter[smin:smax]
        tmp_idx = points[index] - lower
        output[index] = tmp[tuple(tmp_idx.T)]
    return output

@takluyver
Copy link
Member Author

Do you have timings to show how big the difference is? In principle HDF5 should be able to do something very similar internally, but unfortunately it wouldn't be the only case where a Python solution can be faster than letting HDF5 do something.

Your code would only work for chunked datasets, so it would still need the HDF5 mechanism for other types of data storage.

@carloshorn
Copy link

I felt the penalty when I started working with entire orbits, where my previous approach of loading the relevant part using min max index did not work out anymore, because it would basically load the entire data set. So I went back to the point selector, but I had the impression that is quite slow. Therefore, I came up with the above solution which has a predictable memory consumption.

In order to have something reproducible to share with you, I did some timings with an artificial data set. The observed trend is the same for my real-world data.

import timeit
import pathlib
import numpy as np
import h5py
import h5py._hl.selections as sel

def read_points_chunked(ds, points, output=None):
    ...  # see above

def read_points_selector(ds, points, output=None):
    length = len(points)
    if output is None:
        output = np.empty(length, dtype=ds.dtype)
    ps = sel.PointSelection(ds.shape, ds.id.get_space(), points)
    ds.read_direct(output, source_sel=ps)
    return output

# create file
chunks = (1024, 1024)
nj_chunks = 16
ni_chunks = 32

path = pathlib.Path('test.h5')
if not path.is_file():
    with h5py.File(path, mode='w') as h5:
        ds = h5.create_dataset(
            'data', (nj_chunks*chunks[0], ni_chunks*chunks[1]), 
            dtype=np.uint16, chunks=chunks, 
            compression="gzip", compression_opts=9
        )
        for j in range(nj_chunks):
            for i in range(ni_chunks):
                jmin = j*chunks[0]
                jmax = jmin + chunks[0]
                imin = i*chunks[1]
                imax = imin + chunks[1]
                ds[jmin:jmax, imin:imax] = np.ravel_multi_index(
                    (j, i), (nj_chunks, ni_chunks)
                )
                
# create points
N = 50000
jj = np.random.randint(8*chunks[0], 11*chunks[0], N)
ii = np.random.randint(17*chunks[1], 20*chunks[1], N)
points = np.column_stack((jj, ii))

# time it
code_chunked = """
with h5py.File(path, mode='r') as h5:
    ds = h5['data']
    output = read_points_chunked(ds, points)
"""
timer = timeit.Timer(code_chunked, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_chunked = min(raw_timings) / number

code_selector = """
with h5py.File(path, mode='r') as h5:
    ds = h5['data']
    output = read_points_selector(ds, points)
"""
timer = timeit.Timer(code_selector, globals=globals())
number, _ = timer.autorange()
raw_timings = timer.repeat(5, number)
best_selector = min(raw_timings) / number

# print best results
print(f"Chunked: {best_chunked:.3f} seconds")
print(f"Selector: {best_selector:.3f} seconds")

Here I get the following result

Chunked: 0.030 seconds
Selector: 1.445 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Points accessor for datasets?
4 participants