Poor performance streaming object references #2395

bjhardcastle · 2024-03-19T17:20:31Z

When using fsspec to stream hdf5 files with object references, object de-referencing seems to read more data than is necessary:

import time
import fsspec
import h5py
import psutil


LARGE_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/f78/fe2/f78fe2a6-3dc9-4c12-a288-fbf31ce6fc1c'
SMALL_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/56c/31a/56c31a1f-a6fb-4b73-ab7d-98fb5ef9a553' 

url = SMALL_HDF5_URL    #for quicker testing use small file 

fsspec.get_filesystem_class("https").clear_instance_cache()
filesystem = fsspec.filesystem("https")
byte_stream = filesystem.open(path=url, mode="rb", cache_type="first")
nwb = h5py.File(name=byte_stream)
    
# this is an instance of <HDF5 object reference>:
object_reference = nwb['units/electrode_group'][0]

# the location that `object_reference` points to (which currently can't be
# determined from the opaque object reference in python)
url_to_actual_location = {
    LARGE_HDF5_URL: '/general/extracellular_ephys/17216703352 1-281',
    SMALL_HDF5_URL: '/general/extracellular_ephys/18005110031 1-281',
}

def get_time_and_memory():
    m0 = psutil.Process().memory_info().rss
    t0 = time.time()
    yield
    t1 = time.time()
    m1 = psutil.Process().memory_info().rss
    yield f"{t1 - t0:.2f} s, {(m1 - m0) / 1024**2:.2f} MB"

# 1. accessing the location directly and reading metadata is fast:
tm = get_time_and_memory()
next(tm)
_ = nwb[url_to_actual_location[url]].name
print(f"1. Got referenced object data directly: {next(tm)}")

# 2. when using the object reference, a lazy accessor seems to be returned initially
# (which is fast):
tm = get_time_and_memory()
next(tm)
lazy_object_data = nwb[object_reference]
print(f"2. Got lazy object reference: {next(tm)}")

# 3'. de-reference the lazy object to get location and use directly:
tm = get_time_and_memory()
next(tm)
loc = h5py.h5r.get_name(object_reference, nwb.id)
print(f"3''. Got de-referenced location: {next(tm)}")

tm = get_time_and_memory()
next(tm)
reference_path = nwb[loc].name
print(f"3'. Got de-referenced object data: {next(tm)}")

# 3. when the same component is accessed, it is much slower than in 1. - suggests
#    more data than necessary is being read
tm = get_time_and_memory()
next(tm)
reference_path = lazy_object_data.name
print(f"3. Got referenced object data: {next(tm)}")
assert reference_path == url_to_actual_location[url]      

# 4. subsequent access of a different component is fast - supporting the idea that 
#    more data than necessary is being read (and cached) in 3. 
second_object_reference = nwb['units/electrode_group'][-1]
tm = get_time_and_memory()
next(tm)
second_reference_path = nwb[second_object_reference].name
print(f"4. Got second de-referenced object data: {next(tm)}")
assert second_reference_path != url_to_actual_location[url]

output:

1. Got referenced object data directly: 0.99 s, 0.01 MB
2. Got lazy object reference: 0.00 s, 0.02 MB
3''. Got de-referenced location: 15.27 s, 0.32 MB
3'. Got de-referenced object data: 0.00 s, 0.00 MB
3. Got referenced object data: 0.00 s, 0.00 MB
4. Got second referenced object data: 0.39 s, 0.00 MB

This becomes a problem for the large file URL. It would be preferable to get the location that the object points to and use it directly rather than de-reference, but it seems impossible to get the location without reading the entirety of the de-referenced data.

I thought get_name() might help:

h5py/h5py/h5r.pyx

Lines 132 to 147 in d051d24

    
           def get_name(Reference ref not None, ObjectID loc not None): 
        
               """(Reference ref, ObjectID loc) => STRING name 
        
               Determine the name of the object pointed to by this reference. 
        
               """ 
        
               cdef ssize_t namesize = 0 
        
               cdef char* namebuf = NULL 
        
               namesize = H5Rget_name(loc.id, <H5R_type_t>ref.typecode, &ref.ref, NULL, 0) 
        
               if namesize > 0: 
        
                   namebuf = <char*>malloc(namesize+1) 
        
                   try: 
        
                       namesize = H5Rget_name(loc.id, <H5R_type_t>ref.typecode, &ref.ref, namebuf, namesize+1) 
        
                       return namebuf 
        
                   finally: 
        
                       free(namebuf)

but it seems read the same amount of data.

I'm curious why H5Rget_name() (which apparently returns the length of the name) is used instead of H5Rget_name_string()
https://docs.hdfgroup.org/hdf5/v1_14/group___j_h5_r.html#ga48c4d6cb9e011af084d3c8088b121ac5 - but I can't see the source code.

Operating System: Win10
Python version: 3.11.3
Where Python was acquired: system Python
h5py version: 3.10.0
HDF5 version: 1.14.2

The text was updated successfully, but these errors were encountered:

ajelenak · 2024-03-19T18:22:19Z

I think what you reported here is a known issue for any HDF5 file created with default libhdf5 settings and then copied into an object store. Can you report here the output of the h5stat -S myfile.h5 command. The important line starts with File metadata:.

bjhardcastle · 2024-03-19T18:59:09Z

Here's the output from h5stat -S:

Filename: small.hdf5
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
  File metadata: 134232 bytes
  Raw data: 132887502 bytes
  Amount/Percent of tracked free space: 0 bytes/0.0%
  Unaccounted space: 753104 bytes
Total space: 133774838 bytes

ajelenak · 2024-03-19T19:51:22Z

I see two options:

Use libhdf5-1.14.3 with its ros3 driver. This combo is available from Conda Forge package repository. In this case, libhdf5 will cache file's first 16MiB on file open. Perhaps the entire internal metadata wil be picked up. This does not require any modification of the original file.
Using the h5repack CLI tool from libhdf5-1.14.3, create a paged aggregation version of the original file with 8MiB file page:
```
h5repack -S PAGE -G $(expr 8 \* 1024 \* 1024) small.hdf5 co_small.hdf5
```
This method increases output file size, typically by low single percents.

Suggest to use the libhdf5's ros3 driver to avoid mismatch with the fsspec's default request block size for now. Below are h5py open statements that should hopefully show improved performance:

For # 1 above:

h5py.File(SMALL_HDF5_URL, mode='r', driver='ros3')

For # 2:

h5py.File(SMALL_HDF5_URL, mode='r', driver='ros3', page_buf_size=67_108_864)

page_buf_size will set up a 64MiB cache of file pages for up to 8 file pages.

I assumed access to the HDF5 file does not require S3 authentication. If not, then the above commands will require additional keywords.

bjhardcastle · 2024-03-20T22:02:19Z

Thanks a lot for your advice!

I'm trying to follow # 1, but struggling to get h5py 3.10.0 working with libhdf5-1.14.3 on Ubuntu (fails on import, but I'm probably doing something wrong):

>>> python -c "import h5py; print(h5py.version.info)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/pypy3.9/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
ImportError: cannot import name '_errors' from partially initialized module 'h5py' (most likely due to a circular import) (/opt/conda/lib/pypy3.9/site-packages/h5py/__init__.py)

Will update when I can actually test the suggestions.

ajelenak · 2024-05-21T14:22:12Z

@bjhardcastle Any updates?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance streaming object references #2395

Poor performance streaming object references #2395

bjhardcastle commented Mar 19, 2024

ajelenak commented Mar 19, 2024

bjhardcastle commented Mar 19, 2024

ajelenak commented Mar 19, 2024

bjhardcastle commented Mar 20, 2024

ajelenak commented May 21, 2024

Poor performance streaming object references #2395

Poor performance streaming object references #2395

Comments

bjhardcastle commented Mar 19, 2024

ajelenak commented Mar 19, 2024

bjhardcastle commented Mar 19, 2024

ajelenak commented Mar 19, 2024

bjhardcastle commented Mar 20, 2024

ajelenak commented May 21, 2024