Dataset.compression is None when using hdf5plugin compressors #2161

ivirshup · 2022-10-11T10:21:43Z

Description

When a hdf5 dataset is written using one of the compression filters from hdf5plugin, the dataset has a compression attribute of None.

This is a feature request to change that.

Example

import h5py, numpy as np
import hdf5plugin

f = h5py.File("tmp.h5", "w")
dset = f.create_dataset("X", data=np.random.randn(50, 100), **hdf5plugin.LZ4())

assert dset.compression is not None

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 assert dset.compression is not None

But, h5ls is able to see the name of the compression filter, so ideally h5py should too:

f.close()
!h5ls -v tmp.h5

Opened "tmp.h5" with sec2 driver.
X                        Dataset {50/50, 100/100}
    Location:  1:800
    Links:     1
    Chunks:    {25, 50} 10000 bytes
    Storage:   40000 logical bytes, 40064 allocated bytes, 99.84% utilization
    Filter-0:  HDF5 lz4 filter; see http://www.hdfgroup.org/services/contributions.html-32004 OPT {0}
    Type:      native double

Details

This is due to the possible values for Dataset.compression being hardcoded here:

h5py/h5py/_hl/dataset.py

Lines 556 to 563 in 1487a54

    
           @property 
        
           @with_phil 
        
           def compression(self): 
        
               """Compression strategy (or None)""" 
        
               for x in ('gzip','lzf','szip'): 
        
                   if x in self._filters: 
        
                       return x 
        
               return None

It looks like there is an API for getting the names of a filter: H5Zget_filter_info .

But I'm not sure how you would be able to tell whether a filter was a "compressor".

The complicated solution would be to let hdf5plugin register it's compressors with h5py. Hopefully there is a more elegant solution.

Version info

Summary of the h5py configuration

h5py 3.7.0
HDF5 1.12.2
Python 3.9.12 (main, Mar 26 2022, 15:52:10)
[Clang 13.0.0 (clang-1300.0.29.30)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.22.4
cython (built with) 0.29.30
numpy (built against) 1.19.3
HDF5 (built against) 1.12.2

The text was updated successfully, but these errors were encountered:

t20100 · 2022-10-12T12:12:07Z

The complicated solution would be to let hdf5plugin register it's compressors with h5py.

And this would not solve the issue for all cases, because hdf5plugin is not the only way to install compression filters to use with h5py. For instance, on Linux, using h5py (and so libhdf5) installed on the system will load compression filters from a system folder (and/or from a folder set through an environment variable).

takluyver · 2022-10-14T13:39:54Z

What does dset.id.get_create_plist().get_filter(0) give you? There is a 'name' field which this should return, which might also be what h5ls is looking at.

One simple answer could be that if there's any filter ID that we don't recognise, we return 'unknown'. Not very specific, but better than effectively saying 'no compression'. It's also not clear what we'd do if there are two or more compression filters in the pipeline, though I guess that's unlikely.

Our 'gzip' name is also kind of wrong - HDF5 calls it deflate, and what's actually stored is zlib output - deflate with some different wrapper from gzip. But that's part of the API now, so we can't easily change it.

rayosborn · 2022-11-20T20:46:30Z

I was looking into this same problem, so I'm pleased I don't have to create another issue. To answer @takluyver, your suggestion does indeed return information on the compression filter:

>>> f = h5py.File('test.h5', 'w')
>>> f.create_dataset('data', data=numpy.arange(100), **hdf5plugin.Blosc())
>>> f.close()
>>> f = h5py.File('test.h5', 'r')
>>> f['data'].id.get_create_plist().get_filter(0)
(32001, 1, (2, 2, 8, 800, 5, 1, 1), b'blosc')

I second @ivirshup's call for this to be fixed because it confused me for a couple of days. I had assumed that there was a bug in hdf5plugin and no compression was being applied. Of course, f['data'].compression_opts should also return the relevant values.

ajelenak · 2022-11-21T15:42:37Z

What can be improved is the list of known (registered) compression filters so not just the built-ins are recognized. And will have to support multiple compression filters since there are such cases in the NASA satellite data.

vasole · 2022-11-21T16:25:06Z

One simple answer could be that if there's any filter ID that we don't recognise, we return 'unknown'. Not very specific, but
better than effectively saying 'no compression'.

That small modification would already prevent confusion.

ajelenak · 2022-11-21T17:56:47Z

Not every filter is for data compression so if a filter is unknown why it should be reported from Dataset.compression? Is it an unknown compression method, or just an unknown filter?

rayosborn · 2022-11-23T19:53:15Z

There is another twist to this story, discussed in the mention above. In dealing with a NeXpy issue, a user provided a file that contained compressed data, which NeXpy was able to decompress, presumably because of hdf5plugin. However, h5py doesn't seem to recognize the filter number, even though hdf5plugin had been imported and the decoding was successful.

>>> merged_data.nxfile['processed/result/data'].id.get_create_plist().get_filter(0)
ValueError: Filter number is invalid (filter number is invalid)

I should explain that merged_data.nxfile is a h5py.File object.

takluyver · 2022-11-25T11:59:05Z

My thinking on this is to make compression return 'unknown' if there's anything h5py doesn't recognise, and add a new property like dset.filter_names to retrieve the names from the filter info.

(The 'name' for LZ4 is 'HDF5 lz4 filter; see http://www.hdfgroup.org/services/contributions.html', but HDF5 calls the field name, so...)

Not every filter is for data compression so if a filter is unknown why it should be reported from Dataset.compression?

In practice, I think every registered filter not built into HDF5 itself is either doing some kind of compression, or preparing data so that a later compression filter will be more effective - and you can argue that e.g. bitgroom+zlib together is a different compression algorithm to zlib alone. So if there's a filter we don't recognise, chances are good that it's compression.

And of course 'unknown' can also mean 'unknown whether compression is in use'. 😉

However, h5py doesn't seem to recognize the filter number,...

I think this is a separate thing, but can you check ...get_create_plist().get_nfilters()? This "filter number is invalid" is the same error I get when trying to check beyond the end of the filter list, so it sounds like you're looking at a dataset that's not actually using filters.

rayosborn · 2022-11-25T16:08:10Z

@takluyver, you are right about no filter being applied.

>>> merged_data.nxfile['processed/result/data'].id.get_create_plist().get_nfilters()
0

The previous invalid filter number ValueError is what you get whenever a dataset has no applied filters. Ideally, a less misleading error message should be issued by h5py after checking get_nfilters() first, but I guess this is such a low-level function that it is not really a part of the public API. In any case, I was able to read values from the dataset in another conda environment even without having hdf5plugin installed so the bug reported in the NeXpy issue is unrelated, even though importing hdf5plugin apparently fixed it.

ivirshup mentioned this issue Oct 12, 2022

Add experimental support for zstd compression via hdf5plugin library scverse/anndata#828

Closed

flying-sheep mentioned this issue Oct 12, 2022

[feature request] Discover and load compression plugins #2163

Open

rayosborn mentioned this issue Nov 23, 2022

Cannot launch: Issue with import lmfit.models nexpy/nexpy#379

Closed

takluyver linked a pull request Nov 25, 2022 that will close this issue

Better access to compression/filter information #2180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.compression is None when using hdf5plugin compressors #2161

Dataset.compression is None when using hdf5plugin compressors #2161

ivirshup commented Oct 11, 2022

t20100 commented Oct 12, 2022

takluyver commented Oct 14, 2022

rayosborn commented Nov 20, 2022

ajelenak commented Nov 21, 2022

vasole commented Nov 21, 2022

ajelenak commented Nov 21, 2022

rayosborn commented Nov 23, 2022

takluyver commented Nov 25, 2022

rayosborn commented Nov 25, 2022

Dataset.compression is None when using hdf5plugin compressors #2161

Dataset.compression is None when using hdf5plugin compressors #2161

Comments

ivirshup commented Oct 11, 2022

Description

Example

Details

Version info

Summary of the h5py configuration

t20100 commented Oct 12, 2022

takluyver commented Oct 14, 2022

rayosborn commented Nov 20, 2022

ajelenak commented Nov 21, 2022

vasole commented Nov 21, 2022

ajelenak commented Nov 21, 2022

rayosborn commented Nov 23, 2022

takluyver commented Nov 25, 2022

rayosborn commented Nov 25, 2022