Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance streaming object references #2395

Open
bjhardcastle opened this issue Mar 19, 2024 · 5 comments
Open

Poor performance streaming object references #2395

bjhardcastle opened this issue Mar 19, 2024 · 5 comments

Comments

@bjhardcastle
Copy link

When using fsspec to stream hdf5 files with object references, object de-referencing seems to read more data than is necessary:

import time
import fsspec
import h5py
import psutil


LARGE_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/f78/fe2/f78fe2a6-3dc9-4c12-a288-fbf31ce6fc1c'
SMALL_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/56c/31a/56c31a1f-a6fb-4b73-ab7d-98fb5ef9a553' 

url = SMALL_HDF5_URL    #for quicker testing use small file 

fsspec.get_filesystem_class("https").clear_instance_cache()
filesystem = fsspec.filesystem("https")
byte_stream = filesystem.open(path=url, mode="rb", cache_type="first")
nwb = h5py.File(name=byte_stream)
    
# this is an instance of <HDF5 object reference>:
object_reference = nwb['units/electrode_group'][0]

# the location that `object_reference` points to (which currently can't be
# determined from the opaque object reference in python)
url_to_actual_location = {
    LARGE_HDF5_URL: '/general/extracellular_ephys/17216703352 1-281',
    SMALL_HDF5_URL: '/general/extracellular_ephys/18005110031 1-281',
}

def get_time_and_memory():
    m0 = psutil.Process().memory_info().rss
    t0 = time.time()
    yield
    t1 = time.time()
    m1 = psutil.Process().memory_info().rss
    yield f"{t1 - t0:.2f} s, {(m1 - m0) / 1024**2:.2f} MB"

# 1. accessing the location directly and reading metadata is fast:
tm = get_time_and_memory()
next(tm)
_ = nwb[url_to_actual_location[url]].name
print(f"1. Got referenced object data directly: {next(tm)}")

# 2. when using the object reference, a lazy accessor seems to be returned initially
# (which is fast):
tm = get_time_and_memory()
next(tm)
lazy_object_data = nwb[object_reference]
print(f"2. Got lazy object reference: {next(tm)}")

# 3'. de-reference the lazy object to get location and use directly:
tm = get_time_and_memory()
next(tm)
loc = h5py.h5r.get_name(object_reference, nwb.id)
print(f"3''. Got de-referenced location: {next(tm)}")

tm = get_time_and_memory()
next(tm)
reference_path = nwb[loc].name
print(f"3'. Got de-referenced object data: {next(tm)}")

# 3. when the same component is accessed, it is much slower than in 1. - suggests
#    more data than necessary is being read
tm = get_time_and_memory()
next(tm)
reference_path = lazy_object_data.name
print(f"3. Got referenced object data: {next(tm)}")
assert reference_path == url_to_actual_location[url]      

# 4. subsequent access of a different component is fast - supporting the idea that 
#    more data than necessary is being read (and cached) in 3. 
second_object_reference = nwb['units/electrode_group'][-1]
tm = get_time_and_memory()
next(tm)
second_reference_path = nwb[second_object_reference].name
print(f"4. Got second de-referenced object data: {next(tm)}")
assert second_reference_path != url_to_actual_location[url]

output:

1. Got referenced object data directly: 0.99 s, 0.01 MB
2. Got lazy object reference: 0.00 s, 0.02 MB
3''. Got de-referenced location: 15.27 s, 0.32 MB
3'. Got de-referenced object data: 0.00 s, 0.00 MB
3. Got referenced object data: 0.00 s, 0.00 MB
4. Got second referenced object data: 0.39 s, 0.00 MB

This becomes a problem for the large file URL. It would be preferable to get the location that the object points to and use it directly rather than de-reference, but it seems impossible to get the location without reading the entirety of the de-referenced data.

I thought get_name() might help:

h5py/h5py/h5r.pyx

Lines 132 to 147 in d051d24

def get_name(Reference ref not None, ObjectID loc not None):
"""(Reference ref, ObjectID loc) => STRING name
Determine the name of the object pointed to by this reference.
"""
cdef ssize_t namesize = 0
cdef char* namebuf = NULL
namesize = H5Rget_name(loc.id, <H5R_type_t>ref.typecode, &ref.ref, NULL, 0)
if namesize > 0:
namebuf = <char*>malloc(namesize+1)
try:
namesize = H5Rget_name(loc.id, <H5R_type_t>ref.typecode, &ref.ref, namebuf, namesize+1)
return namebuf
finally:
free(namebuf)

but it seems read the same amount of data.

I'm curious why H5Rget_name() (which apparently returns the length of the name) is used instead of H5Rget_name_string()
https://docs.hdfgroup.org/hdf5/v1_14/group___j_h5_r.html#ga48c4d6cb9e011af084d3c8088b121ac5 - but I can't see the source code.


  • Operating System: Win10
  • Python version: 3.11.3
  • Where Python was acquired: system Python
  • h5py version: 3.10.0
  • HDF5 version: 1.14.2
@ajelenak
Copy link
Contributor

I think what you reported here is a known issue for any HDF5 file created with default libhdf5 settings and then copied into an object store. Can you report here the output of the h5stat -S myfile.h5 command. The important line starts with File metadata:.

@bjhardcastle
Copy link
Author

Here's the output from h5stat -S:

Filename: small.hdf5
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
  File metadata: 134232 bytes
  Raw data: 132887502 bytes
  Amount/Percent of tracked free space: 0 bytes/0.0%
  Unaccounted space: 753104 bytes
Total space: 133774838 bytes

@ajelenak
Copy link
Contributor

I see two options:

  1. Use libhdf5-1.14.3 with its ros3 driver. This combo is available from Conda Forge package repository. In this case, libhdf5 will cache file's first 16MiB on file open. Perhaps the entire internal metadata wil be picked up. This does not require any modification of the original file.
  2. Using the h5repack CLI tool from libhdf5-1.14.3, create a paged aggregation version of the original file with 8MiB file page:
    h5repack -S PAGE -G $(expr 8 \* 1024 \* 1024) small.hdf5 co_small.hdf5
    
    This method increases output file size, typically by low single percents.

Suggest to use the libhdf5's ros3 driver to avoid mismatch with the fsspec's default request block size for now. Below are h5py open statements that should hopefully show improved performance:

For # 1 above:

h5py.File(SMALL_HDF5_URL, mode='r', driver='ros3')

For # 2:

h5py.File(SMALL_HDF5_URL, mode='r', driver='ros3', page_buf_size=67_108_864)

page_buf_size will set up a 64MiB cache of file pages for up to 8 file pages.

I assumed access to the HDF5 file does not require S3 authentication. If not, then the above commands will require additional keywords.

@bjhardcastle
Copy link
Author

Thanks a lot for your advice!

I'm trying to follow # 1, but struggling to get h5py 3.10.0 working with libhdf5-1.14.3 on Ubuntu (fails on import, but I'm probably doing something wrong):

>>> python -c "import h5py; print(h5py.version.info)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/pypy3.9/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
ImportError: cannot import name '_errors' from partially initialized module 'h5py' (most likely due to a circular import) (/opt/conda/lib/pypy3.9/site-packages/h5py/__init__.py)

Will update when I can actually test the suggestions.

@ajelenak
Copy link
Contributor

@bjhardcastle Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants