Various changes. #91

ggriffiniii · 2018-12-15T00:56:17Z

This is a large pull request and it makes a number of individual changes. Feel free to pick and choose the pieces that you like and disregard anything that you don't. Along with some small cleanups, this moves to the trait based configuration I outlined in #89, where each of the standard configs are a zero sized type, and adds a CustomConfig, as requested in #87, to work with arbitrary alphabets and padding (well almost arbitrary, they still need to be ascii characters). It also implements SIMD encoding and decoding for x86/x86_64. The SIMD operations slow down encoding/decoding 3 byte buffers by <5%, but show a considerable speedup for any buffer larger than that in the benchmarks.

On my avx2 machine, I see a 3x improvement in throughput when encoding a 3k buffer, and a 5x improvement when decoding a 3k buffer.

This was just a learning exercise for me to get some hands on experience with rust, I'm not invested in any of these changes.

Prior to this commit a variety of property based tests existed and used a manual approach, consisting of filling vectors with random data and choosing random configs using the rand crate. This moves to using quickcheck to handle a lot of the grunt work that was previously done manually. In addition to the simpler code quickcheck also provides "shrinking" of input data to reduce test cases when possible. This can make troubleshooting test failures much simpler. In the process of refactoring the tests it became clear that there exists a lot of redundancy. I took the liberty of removing a variety of test cases that entirely overlapped with an already existing case. I believe the end result is net gain in terms of readability and maintenance of the test cases without any loss of code coverage.

decode_config invokes decode_config_buf which resizes the vector to the correct size. There is no reason for decode_config to attempt to pre-allocate the buffer. In the best case it moves the allocation from within decode_config_buf to within decode_config, but unfortunately there is a bug in the calculation used by decode_config that means it underallocates for tiny buffers (requiring decode_config_buf to allocate again), and allocates way more memory than requires for larger buffers.

The configuration consists of 3 traits. Encoding: Determine how to encode a 6-bit value into an 8-bit value. Decoding: Determine how to decode an 8-bit value into a 6-bit value. Padding: Whether to use padding and if so which character to use. These traits are currently sealed so they cannot be implemented outside this crate. This could be changed in the future, but until the trait definitions get a little more flushed out it seemed wise to not have users relying on the current definition. Each of the 3 supported alphabets defines it's own zero sized type and implements the traits. Public API methods are now converted to be generically bounded by the traits which will monomorphize each function specifically for the desired configuration. This results in a small performance improvement. This method also provides a path to supporting custom alphabets in a reasonable way. That functionality will be done in a future commit.

There is now a CustomConfig type. It supports custom alphabets and padding and can be built with a builder pattern. // A custom config that matches the builtin Crypt alphabet. let custom_config = CustomConfigBuilder::with_alphabet( b"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" ) .with_padding(b'=') .build()?; The alphabet and padding provided must be ascii characters. This limitation is in place to ensure that encoded values are valid UTF-8 and can therefore be represented as rust String values. An immutable pointer to CustomConfig implements Encoding + Decoding + Padding. This allows it to be used in contexts where the standard builtin configs are used. Creating a CustomConfig is somewhat heavyweight and the resulting type is somewhat large (~2.5KB). It is expected that building the config would be early in program execution and that an immutable reference is what will be efficiently passed around. i.e. The configuration would be a good candidate to be used with lazy_static.

This is in anticipation for further submodules within each module.

This submodule will contain the logic and methods for encoding data in larger block sizes.

This submodule will contain the logic and methods for decoding data in larger block sizes. This change also refactors the decode_helper to no longer have baked-in assumptions about the block size used when decoding.

This implements SSE and AVX2 encoding, and AVX2 decoding on x86 and x86_64 for the standard configurations. SSE decoding was slower than the scalar implementation on the machines I have so I did not implement it. Custom configurations always use the scalar configuration.

Previously EncoderWriter stored a mutable reference to the generic writer, which seems intuitive but I believe is an anti-pattern. Instead, it's more flexible to take ownership of the underlying writer. The standard library has a blanket impl<W> Write for &mut W. This allows users that want to retain ownership of the underlying writer to pass a mutable reference into the EncoderWriter, but users that want to relinquish ownership can pass the underlying writer by value and it will be dropped when EncoderWriter is dropped.

marshallpierce · 2018-12-15T00:58:51Z

Thanks, I'll dig into this and your other changes once #90 is dealt with (hopefully this weekend).

ggriffiniii · 2019-07-28T05:11:38Z

I'm going to close this pull request as it's probably suffered bitrot over the months. If you're still interested in implementing this there is an improved version in my base64 library radix64 https://github.com/ggriffiniii/radix64

Feel free to take any bits you like.

marshallpierce · 2019-07-28T13:17:24Z

Thanks for your contributions; I will do that!

ggriffiniii added 10 commits December 14, 2018 16:43

Another run of cargo fmt.

f2eff94

Move encode and decode modules into subdirectories.

c959f65

This is in anticipation for further submodules within each module.

Add a encode::block submodule.

3c66d8f

This submodule will contain the logic and methods for encoding data in larger block sizes.

Add a decode::block submodule.

33f7d16

This submodule will contain the logic and methods for decoding data in larger block sizes. This change also refactors the decode_helper to no longer have baked-in assumptions about the block size used when decoding.

ggriffiniii mentioned this pull request Apr 8, 2019

It's hard to use our own character set tables. How about making encode and decode modules public? #87

Closed

ggriffiniii closed this Jul 28, 2019

ggriffiniii mentioned this pull request Aug 14, 2019

Use radix64 rather than the base64 crate. seanmonstar/reqwest#577

Closed

marshallpierce mentioned this pull request Mar 31, 2020

Path to v1.0? #130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various changes. #91

Various changes. #91

ggriffiniii commented Dec 15, 2018

marshallpierce commented Dec 15, 2018

ggriffiniii commented Jul 28, 2019

marshallpierce commented Jul 28, 2019

Various changes. #91

Various changes. #91

Conversation

ggriffiniii commented Dec 15, 2018

marshallpierce commented Dec 15, 2018

ggriffiniii commented Jul 28, 2019

marshallpierce commented Jul 28, 2019