Add H5DOappend functionality to h5py #1890

alexporter8013 · 2021-05-16T22:45:52Z

I would like to have the functionality of the [H5DOappend] (https://support.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOappend) function added to the h5py library. Usage would be similar to the H5DOwrite_chunk and H5DOread_chunk such that the calls don't enter Python land and the function only appears in h5d.pyx

This should not be a breaking change and the C API page mentions it is only for >= 1.10.0.

The text was updated successfully, but these errors were encountered:

takluyver · 2021-05-19T09:08:27Z

The H5DO functions are part of the 'high-level library', whch unfortunately doesn't have the same nice error reporting that the core HDF5 library has (#1887 is an example of the unhelpful errors from wrapping it). Given that what this does is pretty simple in Python, I wonder if it makes more sense to reimplement it - or even just leave it up to the caller - to avoid confusing error messages.

E.g. for axis 0 (which I imagine is the most common use case), it looks something like this:

prev_len = ds.shape[0]
ds.resize(prev_len + len(array), axis=0)
ds[prev_len:] = arr

alexporter8013 · 2021-05-19T15:40:45Z

I ran into that kind of error in my development, though I think that has more to do with Cython than HDF5, but that's a separate conversation.

As far as adding the append function, that's why I didn't expose it in Python land. It's a "use at your own risk" function much like the chunk functions that are only exposed in Cython.

The code you show is indeed what I'm pretty much doing already, the reason for exposing append is mainly to make my code more performant. Going through h5py for those operations puts on a bunch of training wheels, so to speak, to catch different kind of errors for people that may not know what they're doing. I propose exposing the function in Cython just as an added tool for people, such as me, who think they know what they're doing.

takluyver · 2021-05-19T17:07:36Z

Have you actually measured the speed difference using this wrapped append function rather than doing the resize & write from Python? I would have guessed there wasn't a big performance difference, although I guess there might be significant overhead if you're writing many small bits of data.

To be clear, the method you've added is not only exposed in Cython - it's implemented in Cython and exposed to Python code - like the Python test you wrote. 😉 You've put it in the low-level API, which is more 'manual control', but still part of h5py's public API and (roughly) held to the same quality standards.

Normally, HDF5 returns a status indicating that an error occurred, and stores a string with some detail about what went wrong, which we turn into an exception (actually, it basically stores a stacktrace, but we ignore most of it). The high-level library just indicates that an error occurred, without giving us any message about what that was.

alexporter8013 · 2021-05-19T18:04:31Z

I hadn't run benchmarks before, but I just did and it actually speeds things up quite a bit! I will share some of the results below:

My append function:

    def append(self, data: np.ndarray):

        # valid append conditions:
        # 1. append single value to 1-d dset
        # 2. append 1-d array to 1-d dset
        # 3. append 1-d array of width to 2-d array
        # 4. append 2-d array to 2-d dset
        dlen = len(data)
        self.dset.resize(self.index + dlen, axis=0)
        # self.dset[self.index:self.index + dlen, ...] = data
        dest_slice = np.s_[self.index:self.index + dlen, ...]
        self.dset.write_direct(data, dest_sel=dest_slice)
        # self.dset.id.append(data, 0, dlen)
        self.index += dlen
        return dlen

As you can see, I have comment switches (is there anything like a simple IFDEF in Python?) to use the different methods. I run a benchmark where I simulate logging (doing repeated appends) for the application I wrote this code for. I can enable profiling with the benchmark to aid in performance gains. So, I will be showing the cumulative times of my append function for 3 different scenarios: direct assignment on the dataset object, using write_direct, and using the Cython append.

Results

Direct Assignment:

$ poetry run bench-logging 1000 10 100 --local --profile # this is 1000 iterations with 10 datasets of width 100
Starting benchmark
Benchmark complete
Total time 4.04703688621521
Per iteration 0.00404703688621521
Per logging call 0.000404703688621521
1.977 MB/s

...

         1600063 function calls in 4.035 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

...

    20000    0.172    0.000    3.683    0.000 /home/<redacted>/element_dataset.py:63(append)

Using write_direct

$ poetry run bench-logging 1000 10 100 --local --profile # same comment as above
Starting benchmark
Benchmark complete
Total time 3.612107515335083
Per iteration 0.003612107515335083
Per logging call 0.0003612107515335083
2.215 MB/s

...

         1180243 function calls (1180223 primitive calls) in 3.600 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

...

    20000    0.159    0.000    3.243    0.000 /home/<redacted>/element_dataset.py:63(append)

Using append

$ poetry run bench-logging 1000 10 100 --local --profile
Starting benchmark
Benchmark complete
Total time 1.685542106628418
Per iteration 0.001685542106628418
Per logging call 0.0001685542106628418
4.746 MB/s

...

         230063 function calls in 1.672 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20000    1.338    0.000    1.346    0.000 /home/<redacted>/element_dataset.py:63(append)

As you can see, using append from h5d.pyx is significantly (about 2x) faster, than either of the two other methods.

alexporter8013 · 2021-06-14T19:25:03Z

@takluyver are there any additional questions that need answered to continue progress on this?

tacaswell · 2021-06-21T16:28:15Z

I would be interested is seeing the bench marking without the the profiling turned on. Given the added overhead, in is not clear to me if this is actually measuring a speed up in working with hdf5 or measuring the number of Python function calls made.

I also wonder what effect chunk size has on these benchmarks? For the best performance you should try to align your writes with chunk boundaries (and have reasonably sized chunks). It might be also worth benchmarking buffering (into a pre-allocated numpy array) the appends until you have a full chunk and then writing that down all at once.

On the general topic of benchmarking, https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks is a very good read about why developing unbiased and reliable bench marks is hard.

Do we have any other parts of the hdf5 "high-level" API exposed (in our "low-level" API)? If not, maybe this should go in a new module as a free-function? If we go down the route of adding some "high-level" functions, should we go down the path of adding them all?

takluyver · 2021-06-21T16:34:00Z

We expose the 'dimension scale' parts already as h5py.h5ds (and also in the h5py high-level API).

alexporter8013 · 2021-06-21T16:44:26Z

For the benchmarks, all parameters were the exact same for each of the 3 runs, including chunk size and compression. The only thing I changed were the lines shown in the append member function. My benchmarks try to emulate the actual use-case, which is a data-logging library developed internally. As a note, the other 2 H5DO functions are also exposed in h5py.

takluyver · 2021-06-21T16:54:23Z

The other two H5DO functions got moved into the main HDF5 library in a later release (H5Dwrite_chunk & H5Dread_chunk), so I wouldn't really count them as part of the high-level library.

I think @tacaswell was asking because the standard profiling tools add an overhead on every Python instruction executed. .resize() and .write_direct() are both implemented in Python, so the profiling might add more overhead to these than to calling your added .append() method.

alexporter8013 added a commit to alexporter8013/h5py that referenced this issue May 17, 2021

Added fix for h5py#1890

bfefe48

alexporter8013 linked a pull request Jun 14, 2021 that will close this issue

Implemented H5DOappend #1891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add H5DOappend functionality to h5py #1890

Add H5DOappend functionality to h5py #1890

alexporter8013 commented May 16, 2021

takluyver commented May 19, 2021

alexporter8013 commented May 19, 2021

takluyver commented May 19, 2021

alexporter8013 commented May 19, 2021 •

edited

alexporter8013 commented Jun 14, 2021

tacaswell commented Jun 21, 2021

takluyver commented Jun 21, 2021

alexporter8013 commented Jun 21, 2021

takluyver commented Jun 21, 2021

Add H5DOappend functionality to h5py #1890

Add H5DOappend functionality to h5py #1890

Comments

alexporter8013 commented May 16, 2021

takluyver commented May 19, 2021

alexporter8013 commented May 19, 2021

takluyver commented May 19, 2021

alexporter8013 commented May 19, 2021 • edited

Results

Direct Assignment:

Using write_direct

Using append

alexporter8013 commented Jun 14, 2021

tacaswell commented Jun 21, 2021

takluyver commented Jun 21, 2021

alexporter8013 commented Jun 21, 2021

takluyver commented Jun 21, 2021

alexporter8013 commented May 19, 2021 •

edited