Add dictionary compression support. #10

thoren-d · 2020-06-23T02:46:00Z

LZ4 supports using a dictionary when processing data, which can make a big difference on small messages/files. This change exposes the relevant functions in lz4-sys, and extends Encoder and Decoder to support using dictionaries.

Merge 10XGenomics fork.

LZ4 supports using a dictionary when processing data, which can make a big difference on small messages/files. This change exposes the relevant functions in lz4-sys, and extends Encoder and Decoder to support using dictionaries.

thoren-d · 2020-09-16T00:11:32Z

@pmarks Please take a look, thanks.

pmarks · 2020-09-23T00:04:52Z

@thoren-d, thanks for the contribution! In one of my use cases I hold an lz4::Decoder in a struct - with this change I'm forced to introduce a new lifetime parameter, even though I'm not using a dictionary -- this seems like a high cost to pay for an optional feature. I'm wondering if there's an alternative design that avoids this for the non-dictionary case?

How big are the dictionaries, typically? One option would be that the Decoder could hold a private copy of the dictionary so that there's no new lifetime.

Another option would be to make a new type DictDecoder that you use if you've got a dictionary. Of course that makes it a lot harder to choose whether or not to use a dictionary at runtime.

Any thoughts?

thoren-d · 2020-09-24T05:52:14Z

The dictionaries can be up to 64 KiB so it would be inefficient to make copies of them, especially considering they're most useful when decoding/encoding many small messages.

My thoughts were to allow Encoder and Decoder to have shared ownership of their respective dictionaries. We could do this with either

A concrete type. Arc<[u8]> and Arc<EncoderDictionary> seem the most flexible here. That way a single dictionary can be used by multiple threads to decode or encode numerous messages. I've updated this PR to use this approach.
A generic trait, such as Borrow<[u8]> and Borrow<EncoderDictionary>. This has some more flexibility for users, who could store the dictionary any way they want. However, this would be a breaking API change also, as you'd need to specify what kind of dictionary storage you want anywhere type inference doesn't apply.

Thoren Paulson and others added 2 commits June 22, 2020 19:33

Merge pull request #1 from 10XGenomics/master

a0d0346

Merge 10XGenomics fork.

Add support for dictionary compression.

c4f4fe0

LZ4 supports using a dictionary when processing data, which can make a big difference on small messages/files. This change exposes the relevant functions in lz4-sys, and extends Encoder and Decoder to support using dictionaries.

thoren-d and others added 3 commits September 23, 2020 22:21

Replace references with Arc<...>.

a80959f

libc doesn't define C types for non-WASI wasm, so define them manually

61a6262

Fix merge conflicts.

c888cd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictionary compression support. #10

Add dictionary compression support. #10

thoren-d commented Jun 23, 2020

thoren-d commented Sep 16, 2020

pmarks commented Sep 23, 2020

thoren-d commented Sep 24, 2020 •

edited

Add dictionary compression support. #10

Are you sure you want to change the base?

Add dictionary compression support. #10

Conversation

thoren-d commented Jun 23, 2020

thoren-d commented Sep 16, 2020

pmarks commented Sep 23, 2020

thoren-d commented Sep 24, 2020 • edited

thoren-d commented Sep 24, 2020 •

edited