Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary compression support. #10

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

thoren-d
Copy link

LZ4 supports using a dictionary when processing data, which can make a big difference on small messages/files. This change exposes the relevant functions in lz4-sys, and extends Encoder and Decoder to support using dictionaries.

Thoren Paulson and others added 2 commits June 22, 2020 19:33
LZ4 supports using a dictionary when processing data, which can make
a big difference on small messages/files. This change exposes the
relevant functions in lz4-sys, and extends Encoder and Decoder to
support using dictionaries.
@thoren-d
Copy link
Author

@pmarks Please take a look, thanks.

@pmarks
Copy link

pmarks commented Sep 23, 2020

@thoren-d, thanks for the contribution! In one of my use cases I hold an lz4::Decoder in a struct - with this change I'm forced to introduce a new lifetime parameter, even though I'm not using a dictionary -- this seems like a high cost to pay for an optional feature. I'm wondering if there's an alternative design that avoids this for the non-dictionary case?

How big are the dictionaries, typically? One option would be that the Decoder could hold a private copy of the dictionary so that there's no new lifetime.

Another option would be to make a new type DictDecoder that you use if you've got a dictionary. Of course that makes it a lot harder to choose whether or not to use a dictionary at runtime.

Any thoughts?

@thoren-d
Copy link
Author

thoren-d commented Sep 24, 2020

The dictionaries can be up to 64 KiB so it would be inefficient to make copies of them, especially considering they're most useful when decoding/encoding many small messages.

My thoughts were to allow Encoder and Decoder to have shared ownership of their respective dictionaries. We could do this with either

  1. A concrete type. Arc<[u8]> and Arc<EncoderDictionary> seem the most flexible here. That way a single dictionary can be used by multiple threads to decode or encode numerous messages. I've updated this PR to use this approach.
  2. A generic trait, such as Borrow<[u8]> and Borrow<EncoderDictionary>. This has some more flexibility for users, who could store the dictionary any way they want. However, this would be a breaking API change also, as you'd need to specify what kind of dictionary storage you want anywhere type inference doesn't apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants