New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add H5DOappend functionality to h5py #1890
Comments
The E.g. for axis 0 (which I imagine is the most common use case), it looks something like this: prev_len = ds.shape[0]
ds.resize(prev_len + len(array), axis=0)
ds[prev_len:] = arr |
I ran into that kind of error in my development, though I think that has more to do with Cython than HDF5, but that's a separate conversation. As far as adding the append function, that's why I didn't expose it in Python land. It's a "use at your own risk" function much like the chunk functions that are only exposed in Cython. The code you show is indeed what I'm pretty much doing already, the reason for exposing append is mainly to make my code more performant. Going through h5py for those operations puts on a bunch of training wheels, so to speak, to catch different kind of errors for people that may not know what they're doing. I propose exposing the function in Cython just as an added tool for people, such as me, who think they know what they're doing. |
Have you actually measured the speed difference using this wrapped append function rather than doing the resize & write from Python? I would have guessed there wasn't a big performance difference, although I guess there might be significant overhead if you're writing many small bits of data. To be clear, the method you've added is not only exposed in Cython - it's implemented in Cython and exposed to Python code - like the Python test you wrote. 😉 You've put it in the low-level API, which is more 'manual control', but still part of h5py's public API and (roughly) held to the same quality standards. Normally, HDF5 returns a status indicating that an error occurred, and stores a string with some detail about what went wrong, which we turn into an exception (actually, it basically stores a stacktrace, but we ignore most of it). The high-level library just indicates that an error occurred, without giving us any message about what that was. |
I hadn't run benchmarks before, but I just did and it actually speeds things up quite a bit! I will share some of the results below: My append function: def append(self, data: np.ndarray):
# valid append conditions:
# 1. append single value to 1-d dset
# 2. append 1-d array to 1-d dset
# 3. append 1-d array of width to 2-d array
# 4. append 2-d array to 2-d dset
dlen = len(data)
self.dset.resize(self.index + dlen, axis=0)
# self.dset[self.index:self.index + dlen, ...] = data
dest_slice = np.s_[self.index:self.index + dlen, ...]
self.dset.write_direct(data, dest_sel=dest_slice)
# self.dset.id.append(data, 0, dlen)
self.index += dlen
return dlen As you can see, I have comment switches (is there anything like a simple IFDEF in Python?) to use the different methods. I run a benchmark where I simulate logging (doing repeated appends) for the application I wrote this code for. I can enable profiling with the benchmark to aid in performance gains. So, I will be showing the cumulative times of my append function for 3 different scenarios: direct assignment on the dataset object, using write_direct, and using the Cython append. ResultsDirect Assignment:
Using write_direct
Using append
As you can see, using append from |
@takluyver are there any additional questions that need answered to continue progress on this? |
I would be interested is seeing the bench marking without the the profiling turned on. Given the added overhead, in is not clear to me if this is actually measuring a speed up in working with hdf5 or measuring the number of Python function calls made. I also wonder what effect chunk size has on these benchmarks? For the best performance you should try to align your writes with chunk boundaries (and have reasonably sized chunks). It might be also worth benchmarking buffering (into a pre-allocated numpy array) the appends until you have a full chunk and then writing that down all at once. On the general topic of benchmarking, https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks is a very good read about why developing unbiased and reliable bench marks is hard. Do we have any other parts of the hdf5 "high-level" API exposed (in our "low-level" API)? If not, maybe this should go in a new module as a free-function? If we go down the route of adding some "high-level" functions, should we go down the path of adding them all? |
We expose the 'dimension scale' parts already as |
For the benchmarks, all parameters were the exact same for each of the 3 runs, including chunk size and compression. The only thing I changed were the lines shown in the |
The other two H5DO functions got moved into the main HDF5 library in a later release (H5Dwrite_chunk & H5Dread_chunk), so I wouldn't really count them as part of the high-level library. I think @tacaswell was asking because the standard profiling tools add an overhead on every Python instruction executed. |
I would like to have the functionality of the [H5DOappend] (https://support.hdfgroup.org/HDF5/doc/HL/RM_HDF5Optimized.html#H5DOappend) function added to the h5py library. Usage would be similar to the H5DOwrite_chunk and H5DOread_chunk such that the calls don't enter Python land and the function only appears in
h5d.pyx
This should not be a breaking change and the C API page mentions it is only for >= 1.10.0.
The text was updated successfully, but these errors were encountered: