New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add api to h5py.File
for reading and writing userblock.
#2359
Comments
I don't think HDF5 itself has APIs for this - do you know otherwise? Reading the userblock should be straightforward enough, with the caveat that it's probably only for 'regular' HDF5 files, not those spread over multiple files or accessed on a remote server. Writing it is harder. Your implementation details imply that the file contents are constructed in memory and only written when you close the file, but that's not how HDF5 works - data is written to the file as you go, and that's fundamental to working with data that can be larger than the available RAM. So you either need to know the size of the userblock before you create the file (and use The 'userblock' concept honestly seems like a weird choice on the part of the HDF5 developers. Presumably there was some reason for it at the time, but the apparent lack of API functions to access it makes me suspect that it's never been all that important. 🤷 |
There is no exposed API that I know of. I agree the semantics of writing are weird, given the lack of direct support and the fact the size must know when creating the file for the first time. Could there be a middle ground where I submit a PR for reading and have that be part of h5py as opposed to have my own implementation maintained separately? |
Yeah, I think we'd take a PR to read it. And it would also be easy enough to overwrite a userblock that's already allocated, if that's useful to you. It's just adding & resizing it that's a pain. |
I'm a bit skeptical, if you have a We already expose the API to set the block size at creation time and ask about its size at read time seems simple. One complication I see is how to implement this for the file is not local on the disk (either using object stores or the I got a bit curious about how the userblock works and it looks like libhdf5 just starts checking the bytes at powers-of-2 offset from the start until it finds the magic numbers it is looking for https://github.com/HDFGroup/hdf5/blob/695efa94dfcd62c5ef42d03a7f1425c4105819df/src/H5FDint.c#L136-L201 and https://github.com/HDFGroup/hdf5/blob/695efa94dfcd62c5ef42d03a7f1425c4105819df/src/H5Fprivate.h#L294-L296 which does leave the very low chance of a pathological userblock faking out the library... |
Yeah, that's what I was thinking - only support this for the simple case of a single, local file accessed the normal way (where HDF5 manages the file descriptor). And we just read the whole userblock in one go - I don't think we need to complicate it with ways to read part of the userblock. We should check if |
It seems that libhdf5 patches up the create plist on read: from pathlib import Path
import subprocess
import h5py
with open('/tmp/comment', 'w') as fout:
fout.write('Hello World\n')
with h5py.File('/tmp/inp.h5', 'w') as fout:
fout['a'] = range(5)
# add user block with h5jam
subprocess.run(['h5jam', '-i', 'inp.h5', '-u', 'comment', '-o', 'target.h5'],
cwd='/tmp')
# strip the user black
with open('/tmp/target.h5', 'rb') as fin:
fin.read(512)
with open('/tmp/target2.h5', 'wb') as fout:
fout.write(fin.read())
# add (empty) userblock "manually"
with open('/tmp/inp.h5', 'rb') as fin:
with open('/tmp/target3.h5', 'wb') as fout:
fout.write(b' ' * (2 ** 15))
fout.write(fin.read())
for f in ['inp.h5', 'target.h5', 'target2.h5', 'target3.h5']:
with h5py.File(Path('/tmp/') / f, 'r') as fin:
print(fin, fin.userblock_size)
Maybe it makes sense to have def h5jam(input: str | Path | h5py.File,
userblock : bytes | BytesIO | file | str | Path,
out: str | Path | file,
clobber: bool =False): but probably need some checking to make sure we are not modifying a file we already have open for reading/writing? We already support setting the userblock size at file creation time (looks like Andrew added it in 2011). |
I am skeptical too about handling user block from h5py. The user block is for cases where some information about the file, or some data from the file, are needed and libhdf5 is either not available or not the best choice. Why not write to the user block after the file has been closed (libhdf5 is done with the file)? Accessing the user block is super easy from practically any programming language. Just read the first n bytes (n >= reasonable expectation of the user block size, say, 10MB) from the file with whatever is applicable for the storage system (local, remote, object store, plain HTTP server, etc.) and process that content. |
The trivialness is part of the reason I am interested in adding the PR here. I have a few tools at my job where this is useful and instead of having to maintain my own patch, I could add it here for myself and others. Additionally I would already be opening up the hdf5 with |
Why do you need to know the exact user block size prior to reading? Do you intend to make user block differ in size between files? As long as you read enough bytes to have the entire block (even with some extra bytes) -- that's fine to process the block's content. |
For my use case, yes the userblock size is variable. The process that produces the hdf5 file will calculate the necessary size of the userblock, set it, and then write to the front of the file after libhdf5 is done. I mean sure, I can do this with a function that just scans the file until I hit sequence of bytes that signifies a valid hdf5 file but that makes the implementation slightly more difficult than reading the first 2^N bytes. Probably slower too. |
libhdf5 has to figure out where the HDF5 content begins so it will do the |
Is there something more to discuss here or to close it? |
I was actually going to take a crack at this per what @tacaswell defined I didn't see a general consensus against it |
I have read this Google Groups conversation from 2011. I would like to add an API to
h5py.File
that assist in reading and writing to the userblock.Proposal
Add a property to
h5py.File
that gets/sets abytesarray
that is appropriate to the userblock.Use and implementation details
userblock
obtain the size of the userblock for currently open HDF5 file and return abytesarray
from reading that number of bytes at the start of the HDF5 file that is the userblock.userblock
with abytesarray
store thebytesarray
(If in writing mode, otherwise do nothing, maybe emit warning if in read mode).bytesarray
to start of the file. Emit warning if length of storedbytesarray
is longer than the specified userblock and that excess has been truncated.We have began utilizing the userblock at my work and I would like to add this contribution to
h5py
, I would like the userblock so that we have fast access to identifying data about the HDF5 that we have generated. I would prefer to not store this information as metadata because given the contents of what we would store in the userblock, we may or may not use the HDF5 file in the current process. Storing it as attributes can take longer access than what we would like.The text was updated successfully, but these errors were encountered: