Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add api to h5py.File for reading and writing userblock. #2359

Open
Delengowski opened this issue Dec 22, 2023 · 13 comments
Open

Add api to h5py.File for reading and writing userblock. #2359

Delengowski opened this issue Dec 22, 2023 · 13 comments

Comments

@Delengowski
Copy link
Contributor

I have read this Google Groups conversation from 2011. I would like to add an API to h5py.File that assist in reading and writing to the userblock.

Proposal

Add a property to h5py.File that gets/sets a bytesarray that is appropriate to the userblock.

Use and implementation details

  1. Upon access of userblock obtain the size of the userblock for currently open HDF5 file and return a bytesarray from reading that number of bytes at the start of the HDF5 file that is the userblock.
  2. Upon setting of userblock with a bytesarray store the bytesarray (If in writing mode, otherwise do nothing, maybe emit warning if in read mode).
  3. At close of the file, in write mode, write out HDF5 file as normal but re-open and write the stored bytesarray to start of the file. Emit warning if length of stored bytesarray is longer than the specified userblock and that excess has been truncated.

We have began utilizing the userblock at my work and I would like to add this contribution to h5py, I would like the userblock so that we have fast access to identifying data about the HDF5 that we have generated. I would prefer to not store this information as metadata because given the contents of what we would store in the userblock, we may or may not use the HDF5 file in the current process. Storing it as attributes can take longer access than what we would like.

@takluyver
Copy link
Member

I don't think HDF5 itself has APIs for this - do you know otherwise?

Reading the userblock should be straightforward enough, with the caveat that it's probably only for 'regular' HDF5 files, not those spread over multiple files or accessed on a remote server.

Writing it is harder. Your implementation details imply that the file contents are constructed in memory and only written when you close the file, but that's not how HDF5 works - data is written to the file as you go, and that's fundamental to working with data that can be larger than the available RAM. So you either need to know the size of the userblock before you create the file (and use H5Pset_userblock), or else copy the entire file to add the userblock (this seems to be what the h5jam command does).

The 'userblock' concept honestly seems like a weird choice on the part of the HDF5 developers. Presumably there was some reason for it at the time, but the apparent lack of API functions to access it makes me suspect that it's never been all that important. 🤷

@Delengowski
Copy link
Contributor Author

There is no exposed API that I know of.

I agree the semantics of writing are weird, given the lack of direct support and the fact the size must know when creating the file for the first time.

Could there be a middle ground where I submit a PR for reading and have that be part of h5py as opposed to have my own implementation maintained separately?

@takluyver
Copy link
Member

Yeah, I think we'd take a PR to read it. And it would also be easy enough to overwrite a userblock that's already allocated, if that's useful to you. It's just adding & resizing it that's a pain.

@tacaswell
Copy link
Member

I'm a bit skeptical, if you have a h5py.File object then we have already done all of the hdf5 related work to open the file. I am also a bit worried when ever we go beyond directly wrapping libhdf5 + our core analogy of "dict-of-(dict-of)-numpy arrays with attributes"

We already expose the API to set the block size at creation time and ask about its size at read time seems simple. One complication I see is how to implement this for the file is not local on the disk (either using object stores or the fileobj driver we ship). That said, I think I agree adding API to File to get the (full) userblock back as bytes makes sense to me if we accept "fail hard if it is not trivial".

I got a bit curious about how the userblock works and it looks like libhdf5 just starts checking the bytes at powers-of-2 offset from the start until it finds the magic numbers it is looking for https://github.com/HDFGroup/hdf5/blob/695efa94dfcd62c5ef42d03a7f1425c4105819df/src/H5FDint.c#L136-L201 and https://github.com/HDFGroup/hdf5/blob/695efa94dfcd62c5ef42d03a7f1425c4105819df/src/H5Fprivate.h#L294-L296 which does leave the very low chance of a pathological userblock faking out the library...

@takluyver
Copy link
Member

takluyver commented Dec 27, 2023

One complication I see is how to implement this for the file is not local on the disk (either using object stores or the fileobj driver we ship). That said, I think I agree adding API to File to get the (full) userblock back as bytes makes sense to me if we accept "fail hard if it is not trivial".

Yeah, that's what I was thinking - only support this for the simple case of a single, local file accessed the normal way (where HDF5 manages the file descriptor). And we just read the whole userblock in one go - I don't think we need to complicate it with ways to read part of the userblock.

We should check if file.id.get_create_plist().get_userblock() gives you the size of the userblock as it is when you open the file, or whether there's a stored value from when the HDF5 file was created. I.e. does it know about a userblock added with h5jam, which doesn't modify the HDF5 data?

@tacaswell
Copy link
Member

It seems that libhdf5 patches up the create plist on read:

from pathlib import Path
import subprocess
import h5py

with open('/tmp/comment', 'w') as fout:
    fout.write('Hello World\n')

with h5py.File('/tmp/inp.h5', 'w') as fout:
    fout['a'] = range(5)

# add user block with h5jam
subprocess.run(['h5jam', '-i', 'inp.h5', '-u', 'comment', '-o', 'target.h5'],
               cwd='/tmp')

# strip the user black
with open('/tmp/target.h5', 'rb') as fin:
    fin.read(512)
    with open('/tmp/target2.h5', 'wb') as fout:
        fout.write(fin.read())

# add (empty) userblock "manually"
with open('/tmp/inp.h5', 'rb') as fin:
    with open('/tmp/target3.h5', 'wb') as fout:
        fout.write(b' ' * (2 ** 15))
        fout.write(fin.read())


for f in ['inp.h5', 'target.h5', 'target2.h5', 'target3.h5']:
    with h5py.File(Path('/tmp/') / f, 'r') as fin:
        print(fin, fin.userblock_size)
<HDF5 file "inp.h5" (mode r)> 0
<HDF5 file "target.h5" (mode r)> 512
<HDF5 file "target2.h5" (mode r)> 0
<HDF5 file "target3.h5" (mode r)> 32768

Maybe it makes sense to have File.get_userblock for reading and add top-level functions (h5py.jam and h5py.unjam) for writing? I can also see a case for File.write_userblock(bytes) that lets you clobber the existing userblock data up to the size of the existing userblock, but that might also be better folded into h5py.jam to let it have a signature like:

def h5jam(input: str | Path | h5py.File, 
          userblock : bytes | BytesIO | file | str | Path, 
          out: str | Path | file,
          clobber: bool =False):

but probably need some checking to make sure we are not modifying a file we already have open for reading/writing?

We already support setting the userblock size at file creation time (looks like Andrew added it in 2011).

@ajelenak
Copy link
Contributor

I am skeptical too about handling user block from h5py. The user block is for cases where some information about the file, or some data from the file, are needed and libhdf5 is either not available or not the best choice. Why not write to the user block after the file has been closed (libhdf5 is done with the file)?

Accessing the user block is super easy from practically any programming language. Just read the first n bytes (n >= reasonable expectation of the user block size, say, 10MB) from the file with whatever is applicable for the storage system (local, remote, object store, plain HTTP server, etc.) and process that content.

@Delengowski
Copy link
Contributor Author

Accessing the user block is super easy from practically any programming language. Just read the first n bytes (n >= reasonable expectation of the user block size, say, 10MB) from the file with whatever is applicable for the storage system (local, remote, object store, plain HTTP server, etc.) and process that content.

The trivialness is part of the reason I am interested in adding the PR here. I have a few tools at my job where this is useful and instead of having to maintain my own patch, I could add it here for myself and others. Additionally I would already be opening up the hdf5 with file.id.get_create_plist().get_userblock() to get the size I need to read.

@ajelenak
Copy link
Contributor

Why do you need to know the exact user block size prior to reading? Do you intend to make user block differ in size between files?

As long as you read enough bytes to have the entire block (even with some extra bytes) -- that's fine to process the block's content.

@Delengowski
Copy link
Contributor Author

For my use case, yes the userblock size is variable. The process that produces the hdf5 file will calculate the necessary size of the userblock, set it, and then write to the front of the file after libhdf5 is done.

I mean sure, I can do this with a function that just scans the file until I hit sequence of bytes that signifies a valid hdf5 file but that makes the implementation slightly more difficult than reading the first 2^N bytes. Probably slower too.

@ajelenak
Copy link
Contributor

libhdf5 has to figure out where the HDF5 content begins so it will do the $2^N$ ($N&gt;=9$) bytes sampling on file open in order to give you the user block size later. Since user block content is entirely arbitrary, you could store the actual file's block size in its first eight bytes.

@ajelenak
Copy link
Contributor

Is there something more to discuss here or to close it?

@Delengowski
Copy link
Contributor Author

I was actually going to take a crack at this per what @tacaswell defined

I didn't see a general consensus against it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants