Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd: ZSTD_getDecompressedSize is obsolete and used incorrectly. #499

Open
mkitti opened this issue Dec 20, 2023 · 7 comments
Open

Zstd: ZSTD_getDecompressedSize is obsolete and used incorrectly. #499

mkitti opened this issue Dec 20, 2023 · 7 comments

Comments

@mkitti
Copy link
Contributor

mkitti commented Dec 20, 2023

numcodecs currently treats a return value of 0 from ZSTD_getDecompressedSize as an input error. A value of zero could mean one of the following.

  1. empty
  2. unknown
  3. error

dest_size = ZSTD_getDecompressedSize(source_ptr, source_size)
if dest_size == 0:
raise RuntimeError('Zstd decompression error: invalid input data')

Rather numcodecs should use ZSTD_getFrameContentSize which the return value can be differentiated.

  1. 0 means empty
  2. 0xffffffffffffffff, ZSTD_CONTENTSIZE_UNKNOWN, means unknown
  3. 0xfffffffffffffffe, ZSTD_CONTENTSIZE_ERROR, means error

See zstd.h or the manual for a reference.
https://github.com/facebook/zstd/blob/7cf62bc274105f5332bf2d28c57cb6e5669da4d8/lib/zstd.h#L195-L203
https://facebook.github.io/zstd/zstd_manual.html

This error arose during the implementation of Zstandard in n5-zarr:
saalfeldlab/n5-zarr#35

There the compressor was producing blocks which would return ZSTD_CONTENTSIZE_UNKNOWN. ZSTD_getDecompressedSize would return 0 and numcodecs would incorrectly interpret this as an error.

Handling ZSTD_CONTENTSIZE_UNKNOWN may be difficult.

  1. If a dest buffer is provided, then perhaps that should we set as the expected decompressed size and an error should occur if the decompressed size is not that.
  2. If a dest buffer is not provided, we may need to either use a default or use the streaming API to build an growing buffer until all the data is decompressed.
@mkitti
Copy link
Contributor Author

mkitti commented Dec 20, 2023

@d-v-b
Copy link
Contributor

d-v-b commented Dec 20, 2023

what if we just added python-zstandard as a dependency

@mkitti
Copy link
Contributor Author

mkitti commented Dec 20, 2023

We probably should. My suggestion is that we break numcodecs up into smaller packages focusing on individual upstream codecs and then combine them via some metapackage.

It would also be good if we had a mechanism to manage multiple implementations of the same codec.

@martindurant
Copy link
Member

I have mentioned this before, but cramjam includes many of these standard compressors, and already compiles and builds everything in latest rust packages. Simply depending on them would simplify this repo a lot.

@mkitti
Copy link
Contributor Author

mkitti commented Dec 20, 2023

cramjam does not do blosc though. I suppose we could just depend on https://github.com/Blosc/python-blosc2 .

I think we may need a new repository or Github organization for this.

Could someone sketch out an optional dependency package structure and how we make that work both in pip and conda or prefix?

@martindurant
Copy link
Member

cramjam does not do blosc though

Correct, but it does deal with issues like this specific one. And indeed, I'd be happy to delegate blosc to someone else too ( or push on milesgranger/cramjam#110 ).

@joshmoore
Copy link
Member

This is sounding a bit as if a WG like the zarr-python-refactoring WG might also be appropriate here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants