Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better access to compression/filter information #2180

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

takluyver
Copy link
Member

This follows from the discussion in #2161. The changes are:

  • dset.compression will return 'unknown' rather than None if a plugin filter is used and none of the three built-in compression filters (gzip, lzf, szip) are included.
    • Technically the unknown filter doesn't have to be a compression filter, but it probably will be.
    • What should we do if there are plugin filters used and one that we recognise (like 'gzip')? In master, and in this PR at present, we still name the filter we recognise.
  • New properties dset.filter_ids and dset.filter_names give you more information on all the filters in use, compression or otherwise.
    • If you need to get the filter options or flags, you still need to go to the low-level API: dset.id.get_create_plist().get_filter(i)
  • group.create_dataset_like() can now copy a filter pipeline even if it doesn't recognise the filters in it.
    • I'm still figuring out parts of how this interacts with keyword arguments.

Closes #2161

@takluyver takluyver marked this pull request as ready for review November 26, 2022 12:20
@takluyver
Copy link
Member Author

Well, it now looks like I've reimplemented part of HDF5's property list API in Python, to manage filters. That feels kind of silly, but I couldn't see a better way to allow create_dataset_like() to do things like 'replace gzip compression with Blosc' or 'leave the compression alone, but add a shuffle filter before the compression step', while also allowing for arbitrary filters.

@codecov
Copy link

codecov bot commented Nov 26, 2022

Codecov Report

Base: 89.74% // Head: 89.66% // Decreases project coverage by -0.08% ⚠️

Coverage data is based on head (0973cec) compared to base (9fe1d9d).
Patch coverage: 92.24% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2180      +/-   ##
==========================================
- Coverage   89.74%   89.66%   -0.09%     
==========================================
  Files          17       17              
  Lines        2390     2476      +86     
==========================================
+ Hits         2145     2220      +75     
- Misses        245      256      +11     
Impacted Files Coverage Δ
h5py/_hl/filters.py 91.79% <90.42%> (-1.07%) ⬇️
h5py/_hl/dataset.py 92.90% <100.00%> (-0.34%) ⬇️
h5py/_hl/group.py 96.98% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ajelenak
Copy link
Contributor

@takluyver thanks for this PR. Why should Dataset.compression only return something for those three compression filters? Also, are lzf and szip really built-in because libhdf5 only comes with gzip? Although, even this one is a build configuration option. How about expanding for all the registered compression filters? Another issue is that I am aware of some HDF5 datasets out there with more than one compression filter. Don't think that was intended but those data exist now.

Another suggestion: Would something like Dataset.filters instead of filter_ids and filter_names work? I feel like having more than one filter property is not necessary. The filters property could return the complete information for each filter: id, name, settings. Is there a performance hit for immediately collecting the full dataset filter information?

@takluyver
Copy link
Member Author

Why should Dataset.compression only return something for those three compression filters?

As you've noted, the notion of reporting compression as a single string doesn't really work, because there can be multiple filters used - either by mistake, or filters like shuffle that are part of the compression even though they don't compress data themselves.

So the compromise I picked here is to add dset.filter_names to get a more complete view of the filter pipeline, but leave dset.compression working as before for the three names that you can pass in to grp.create_dataset(..., compression=x). I think returning 'unknown' for other filters makes it clearer that you can't pass it back in. We could make it guess when there's just 1 unknown filter, though.

Would something like Dataset.filters instead of filter_ids and filter_names work?

It can certainly work technically, but my guess is that for the high-level API, having ways to get just the names or just the IDs makes more sense. So long as the stored names are meaningful, the two should give you the same information, but the names are convenient for human understanding, whereas the IDs are more reliable for writing code (e.g. if BLOSC_ID in dset.filter_ids).

I think the filter settings can be left to the low-level API. It's not much extra code to access them:

dset.id.get_create_plist().get_filter(0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset.compression is None when using hdf5plugin compressors
2 participants