Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.compression is None when using hdf5plugin compressors #2161

Open
ivirshup opened this issue Oct 11, 2022 · 9 comments · May be fixed by #2180
Open

Dataset.compression is None when using hdf5plugin compressors #2161

ivirshup opened this issue Oct 11, 2022 · 9 comments · May be fixed by #2180

Comments

@ivirshup
Copy link

Description

When a hdf5 dataset is written using one of the compression filters from hdf5plugin, the dataset has a compression attribute of None.

This is a feature request to change that.

Example

import h5py, numpy as np
import hdf5plugin

f = h5py.File("tmp.h5", "w")
dset = f.create_dataset("X", data=np.random.randn(50, 100), **hdf5plugin.LZ4())

assert dset.compression is not None
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 assert dset.compression is not None

But, h5ls is able to see the name of the compression filter, so ideally h5py should too:

f.close()
!h5ls -v tmp.h5
Opened "tmp.h5" with sec2 driver.
X                        Dataset {50/50, 100/100}
    Location:  1:800
    Links:     1
    Chunks:    {25, 50} 10000 bytes
    Storage:   40000 logical bytes, 40064 allocated bytes, 99.84% utilization
    Filter-0:  HDF5 lz4 filter; see http://www.hdfgroup.org/services/contributions.html-32004 OPT {0}
    Type:      native double

Details

This is due to the possible values for Dataset.compression being hardcoded here:

h5py/h5py/_hl/dataset.py

Lines 556 to 563 in 1487a54

@property
@with_phil
def compression(self):
"""Compression strategy (or None)"""
for x in ('gzip','lzf','szip'):
if x in self._filters:
return x
return None

It looks like there is an API for getting the names of a filter: H5Zget_filter_info .

But I'm not sure how you would be able to tell whether a filter was a "compressor".

The complicated solution would be to let hdf5plugin register it's compressors with h5py. Hopefully there is a more elegant solution.

Version info

Summary of the h5py configuration

h5py 3.7.0
HDF5 1.12.2
Python 3.9.12 (main, Mar 26 2022, 15:52:10)
[Clang 13.0.0 (clang-1300.0.29.30)]
sys.platform darwin
sys.maxsize 9223372036854775807
numpy 1.22.4
cython (built with) 0.29.30
numpy (built against) 1.19.3
HDF5 (built against) 1.12.2

@t20100
Copy link
Contributor

t20100 commented Oct 12, 2022

The complicated solution would be to let hdf5plugin register it's compressors with h5py.

And this would not solve the issue for all cases, because hdf5plugin is not the only way to install compression filters to use with h5py. For instance, on Linux, using h5py (and so libhdf5) installed on the system will load compression filters from a system folder (and/or from a folder set through an environment variable).

@takluyver
Copy link
Member

What does dset.id.get_create_plist().get_filter(0) give you? There is a 'name' field which this should return, which might also be what h5ls is looking at.

One simple answer could be that if there's any filter ID that we don't recognise, we return 'unknown'. Not very specific, but better than effectively saying 'no compression'. It's also not clear what we'd do if there are two or more compression filters in the pipeline, though I guess that's unlikely.

Our 'gzip' name is also kind of wrong - HDF5 calls it deflate, and what's actually stored is zlib output - deflate with some different wrapper from gzip. But that's part of the API now, so we can't easily change it.

@rayosborn
Copy link

I was looking into this same problem, so I'm pleased I don't have to create another issue. To answer @takluyver, your suggestion does indeed return information on the compression filter:

>>> f = h5py.File('test.h5', 'w')
>>> f.create_dataset('data', data=numpy.arange(100), **hdf5plugin.Blosc())
>>> f.close()
>>> f = h5py.File('test.h5', 'r')
>>> f['data'].id.get_create_plist().get_filter(0)
(32001, 1, (2, 2, 8, 800, 5, 1, 1), b'blosc')

I second @ivirshup's call for this to be fixed because it confused me for a couple of days. I had assumed that there was a bug in hdf5plugin and no compression was being applied. Of course, f['data'].compression_opts should also return the relevant values.

@ajelenak
Copy link
Contributor

What can be improved is the list of known (registered) compression filters so not just the built-ins are recognized. And will have to support multiple compression filters since there are such cases in the NASA satellite data.

@vasole
Copy link
Contributor

vasole commented Nov 21, 2022

One simple answer could be that if there's any filter ID that we don't recognise, we return 'unknown'. Not very specific, but
better than effectively saying 'no compression'.

That small modification would already prevent confusion.

@ajelenak
Copy link
Contributor

Not every filter is for data compression so if a filter is unknown why it should be reported from Dataset.compression? Is it an unknown compression method, or just an unknown filter?

@rayosborn
Copy link

There is another twist to this story, discussed in the mention above. In dealing with a NeXpy issue, a user provided a file that contained compressed data, which NeXpy was able to decompress, presumably because of hdf5plugin. However, h5py doesn't seem to recognize the filter number, even though hdf5plugin had been imported and the decoding was successful.

>>> merged_data.nxfile['processed/result/data'].id.get_create_plist().get_filter(0)
ValueError: Filter number is invalid (filter number is invalid)

I should explain that merged_data.nxfile is a h5py.File object.

@takluyver
Copy link
Member

My thinking on this is to make compression return 'unknown' if there's anything h5py doesn't recognise, and add a new property like dset.filter_names to retrieve the names from the filter info.

(The 'name' for LZ4 is 'HDF5 lz4 filter; see http://www.hdfgroup.org/services/contributions.html', but HDF5 calls the field name, so...)

Not every filter is for data compression so if a filter is unknown why it should be reported from Dataset.compression?

In practice, I think every registered filter not built into HDF5 itself is either doing some kind of compression, or preparing data so that a later compression filter will be more effective - and you can argue that e.g. bitgroom+zlib together is a different compression algorithm to zlib alone. So if there's a filter we don't recognise, chances are good that it's compression.

And of course 'unknown' can also mean 'unknown whether compression is in use'. 😉

However, h5py doesn't seem to recognize the filter number,...

I think this is a separate thing, but can you check ...get_create_plist().get_nfilters()? This "filter number is invalid" is the same error I get when trying to check beyond the end of the filter list, so it sounds like you're looking at a dataset that's not actually using filters.

@rayosborn
Copy link

@takluyver, you are right about no filter being applied.

>>> merged_data.nxfile['processed/result/data'].id.get_create_plist().get_nfilters()
0

The previous invalid filter number ValueError is what you get whenever a dataset has no applied filters. Ideally, a less misleading error message should be issued by h5py after checking get_nfilters() first, but I guess this is such a low-level function that it is not really a part of the public API. In any case, I was able to read values from the dataset in another conda environment even without having hdf5plugin installed so the bug reported in the NeXpy issue is unrelated, even though importing hdf5plugin apparently fixed it.

@takluyver takluyver linked a pull request Nov 25, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants