Better access to compression/filter information #2180

takluyver · 2022-11-25T17:51:34Z

This follows from the discussion in #2161. The changes are:

dset.compression will return 'unknown' rather than None if a plugin filter is used and none of the three built-in compression filters (gzip, lzf, szip) are included.
- Technically the unknown filter doesn't have to be a compression filter, but it probably will be.
- What should we do if there are plugin filters used and one that we recognise (like 'gzip')? In master, and in this PR at present, we still name the filter we recognise.
New properties dset.filter_ids and dset.filter_names give you more information on all the filters in use, compression or otherwise.
- If you need to get the filter options or flags, you still need to go to the low-level API: dset.id.get_create_plist().get_filter(i)
group.create_dataset_like() can now copy a filter pipeline even if it doesn't recognise the filters in it.
- I'm still figuring out parts of how this interacts with keyword arguments.

takluyver · 2022-11-26T12:24:55Z

Well, it now looks like I've reimplemented part of HDF5's property list API in Python, to manage filters. That feels kind of silly, but I couldn't see a better way to allow create_dataset_like() to do things like 'replace gzip compression with Blosc' or 'leave the compression alone, but add a shuffle filter before the compression step', while also allowing for arbitrary filters.

codecov · 2022-11-26T14:34:58Z

Codecov Report

Base: 89.74% // Head: 89.66% // Decreases project coverage by -0.08% ⚠️

Coverage data is based on head (0973cec) compared to base (9fe1d9d).
Patch coverage: 92.24% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2180      +/-   ##
==========================================
- Coverage   89.74%   89.66%   -0.09%     
==========================================
  Files          17       17              
  Lines        2390     2476      +86     
==========================================
+ Hits         2145     2220      +75     
- Misses        245      256      +11

Impacted Files	Coverage Δ
h5py/_hl/filters.py	`91.79% <90.42%> (-1.07%)`	⬇️
h5py/_hl/dataset.py	`92.90% <100.00%> (-0.34%)`	⬇️
h5py/_hl/group.py	`96.98% <100.00%> (+0.02%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ajelenak · 2022-11-26T18:40:55Z

@takluyver thanks for this PR. Why should Dataset.compression only return something for those three compression filters? Also, are lzf and szip really built-in because libhdf5 only comes with gzip? Although, even this one is a build configuration option. How about expanding for all the registered compression filters? Another issue is that I am aware of some HDF5 datasets out there with more than one compression filter. Don't think that was intended but those data exist now.

Another suggestion: Would something like Dataset.filters instead of filter_ids and filter_names work? I feel like having more than one filter property is not necessary. The filters property could return the complete information for each filter: id, name, settings. Is there a performance hit for immediately collecting the full dataset filter information?

takluyver · 2023-01-09T12:24:29Z

Why should Dataset.compression only return something for those three compression filters?

As you've noted, the notion of reporting compression as a single string doesn't really work, because there can be multiple filters used - either by mistake, or filters like shuffle that are part of the compression even though they don't compress data themselves.

So the compromise I picked here is to add dset.filter_names to get a more complete view of the filter pipeline, but leave dset.compression working as before for the three names that you can pass in to grp.create_dataset(..., compression=x). I think returning 'unknown' for other filters makes it clearer that you can't pass it back in. We could make it guess when there's just 1 unknown filter, though.

Would something like Dataset.filters instead of filter_ids and filter_names work?

It can certainly work technically, but my guess is that for the high-level API, having ways to get just the names or just the IDs makes more sense. So long as the stored names are meaningful, the two should give you the same information, but the names are convenient for human understanding, whereas the IDs are more reliable for writing code (e.g. if BLOSC_ID in dset.filter_ids).

I think the filter settings can be left to the low-level API. It's not much extra code to access them:

dset.id.get_create_plist().get_filter(0)

takluyver added 4 commits November 25, 2022 15:41

Add filter_ids and filter_names properties to dataset

6eeacc9

ds.compression returns 'unknown' for unknown filters

ef991e2

Refine how create_dataset_like() works with filters

8fe7240

Better allow for modifying and replacing filters in pipeline

bf973dc

takluyver marked this pull request as ready for review November 26, 2022 12:20

Mark tests requiring zlib / lzf

0973cec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better access to compression/filter information #2180

Better access to compression/filter information #2180

takluyver commented Nov 25, 2022

takluyver commented Nov 26, 2022

codecov bot commented Nov 26, 2022 •

edited

ajelenak commented Nov 26, 2022

takluyver commented Jan 9, 2023

Better access to compression/filter information #2180

Are you sure you want to change the base?

Better access to compression/filter information #2180

Conversation

takluyver commented Nov 25, 2022

takluyver commented Nov 26, 2022

codecov bot commented Nov 26, 2022 • edited

Codecov Report

ajelenak commented Nov 26, 2022

takluyver commented Jan 9, 2023

codecov bot commented Nov 26, 2022 •

edited