Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add H5DOappend functionality to h5py #1890

Open
alexporter8013 opened this issue May 16, 2021 · 9 comments · May be fixed by #1891
Open

Add H5DOappend functionality to h5py #1890

alexporter8013 opened this issue May 16, 2021 · 9 comments · May be fixed by #1891

Comments

@alexporter8013
Copy link
Contributor

I would like to have the functionality of the [H5DOappend] (https://support.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOappend) function added to the h5py library. Usage would be similar to the H5DOwrite_chunk and H5DOread_chunk such that the calls don't enter Python land and the function only appears in h5d.pyx

This should not be a breaking change and the C API page mentions it is only for >= 1.10.0.

alexporter8013 added a commit to alexporter8013/h5py that referenced this issue May 17, 2021
@takluyver
Copy link
Member

The H5DO functions are part of the 'high-level library', whch unfortunately doesn't have the same nice error reporting that the core HDF5 library has (#1887 is an example of the unhelpful errors from wrapping it). Given that what this does is pretty simple in Python, I wonder if it makes more sense to reimplement it - or even just leave it up to the caller - to avoid confusing error messages.

E.g. for axis 0 (which I imagine is the most common use case), it looks something like this:

prev_len = ds.shape[0]
ds.resize(prev_len + len(array), axis=0)
ds[prev_len:] = arr

@alexporter8013
Copy link
Contributor Author

I ran into that kind of error in my development, though I think that has more to do with Cython than HDF5, but that's a separate conversation.

As far as adding the append function, that's why I didn't expose it in Python land. It's a "use at your own risk" function much like the chunk functions that are only exposed in Cython.

The code you show is indeed what I'm pretty much doing already, the reason for exposing append is mainly to make my code more performant. Going through h5py for those operations puts on a bunch of training wheels, so to speak, to catch different kind of errors for people that may not know what they're doing. I propose exposing the function in Cython just as an added tool for people, such as me, who think they know what they're doing.

@takluyver
Copy link
Member

Have you actually measured the speed difference using this wrapped append function rather than doing the resize & write from Python? I would have guessed there wasn't a big performance difference, although I guess there might be significant overhead if you're writing many small bits of data.

To be clear, the method you've added is not only exposed in Cython - it's implemented in Cython and exposed to Python code - like the Python test you wrote. 😉 You've put it in the low-level API, which is more 'manual control', but still part of h5py's public API and (roughly) held to the same quality standards.

Normally, HDF5 returns a status indicating that an error occurred, and stores a string with some detail about what went wrong, which we turn into an exception (actually, it basically stores a stacktrace, but we ignore most of it). The high-level library just indicates that an error occurred, without giving us any message about what that was.

@alexporter8013
Copy link
Contributor Author

alexporter8013 commented May 19, 2021

I hadn't run benchmarks before, but I just did and it actually speeds things up quite a bit! I will share some of the results below:

My append function:

    def append(self, data: np.ndarray):

        # valid append conditions:
        # 1. append single value to 1-d dset
        # 2. append 1-d array to 1-d dset
        # 3. append 1-d array of width to 2-d array
        # 4. append 2-d array to 2-d dset
        dlen = len(data)
        self.dset.resize(self.index + dlen, axis=0)
        # self.dset[self.index:self.index + dlen, ...] = data
        dest_slice = np.s_[self.index:self.index + dlen, ...]
        self.dset.write_direct(data, dest_sel=dest_slice)
        # self.dset.id.append(data, 0, dlen)
        self.index += dlen
        return dlen

As you can see, I have comment switches (is there anything like a simple IFDEF in Python?) to use the different methods. I run a benchmark where I simulate logging (doing repeated appends) for the application I wrote this code for. I can enable profiling with the benchmark to aid in performance gains. So, I will be showing the cumulative times of my append function for 3 different scenarios: direct assignment on the dataset object, using write_direct, and using the Cython append.

Results

Direct Assignment:

$ poetry run bench-logging 1000 10 100 --local --profile # this is 1000 iterations with 10 datasets of width 100
Starting benchmark
Benchmark complete
Total time 4.04703688621521
Per iteration 0.00404703688621521
Per logging call 0.000404703688621521
1.977 MB/s

...

         1600063 function calls in 4.035 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

...

    20000    0.172    0.000    3.683    0.000 /home/<redacted>/element_dataset.py:63(append)

Using write_direct

$ poetry run bench-logging 1000 10 100 --local --profile # same comment as above
Starting benchmark
Benchmark complete
Total time 3.612107515335083
Per iteration 0.003612107515335083
Per logging call 0.0003612107515335083
2.215 MB/s

...

         1180243 function calls (1180223 primitive calls) in 3.600 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

...

    20000    0.159    0.000    3.243    0.000 /home/<redacted>/element_dataset.py:63(append)

Using append

$ poetry run bench-logging 1000 10 100 --local --profile
Starting benchmark
Benchmark complete
Total time 1.685542106628418
Per iteration 0.001685542106628418
Per logging call 0.0001685542106628418
4.746 MB/s

...

         230063 function calls in 1.672 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20000    1.338    0.000    1.346    0.000 /home/<redacted>/element_dataset.py:63(append)

As you can see, using append from h5d.pyx is significantly (about 2x) faster, than either of the two other methods.

@alexporter8013
Copy link
Contributor Author

@takluyver are there any additional questions that need answered to continue progress on this?

@alexporter8013 alexporter8013 linked a pull request Jun 14, 2021 that will close this issue
@tacaswell
Copy link
Member

I would be interested is seeing the bench marking without the the profiling turned on. Given the added overhead, in is not clear to me if this is actually measuring a speed up in working with hdf5 or measuring the number of Python function calls made.

I also wonder what effect chunk size has on these benchmarks? For the best performance you should try to align your writes with chunk boundaries (and have reasonably sized chunks). It might be also worth benchmarking buffering (into a pre-allocated numpy array) the appends until you have a full chunk and then writing that down all at once.

On the general topic of benchmarking, https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks is a very good read about why developing unbiased and reliable bench marks is hard.


Do we have any other parts of the hdf5 "high-level" API exposed (in our "low-level" API)? If not, maybe this should go in a new module as a free-function? If we go down the route of adding some "high-level" functions, should we go down the path of adding them all?

@takluyver
Copy link
Member

We expose the 'dimension scale' parts already as h5py.h5ds (and also in the h5py high-level API).

@alexporter8013
Copy link
Contributor Author

For the benchmarks, all parameters were the exact same for each of the 3 runs, including chunk size and compression. The only thing I changed were the lines shown in the append member function. My benchmarks try to emulate the actual use-case, which is a data-logging library developed internally. As a note, the other 2 H5DO functions are also exposed in h5py.

@takluyver
Copy link
Member

The other two H5DO functions got moved into the main HDF5 library in a later release (H5Dwrite_chunk & H5Dread_chunk), so I wouldn't really count them as part of the high-level library.

I think @tacaswell was asking because the standard profiling tools add an overhead on every Python instruction executed. .resize() and .write_direct() are both implemented in Python, so the profiling might add more overhead to these than to calling your added .append() method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants