Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to universal-hash 0.5 #153

Closed
wants to merge 2 commits into from

Conversation

str4d
Copy link
Contributor

@str4d str4d commented Apr 17, 2022

No description provided.

@tarcieri
Copy link
Member

tarcieri commented Apr 20, 2022

Thanks @str4d!

@newpavlov I think we can close #148 and work off of this?

The cross-related errors are because it doesn't pick up the workspace-level Cargo.toml because it copies the current crate (not the entire workspace) into a Docker container. Because of that it doesn't see the patch.crates-io directive.

The solutions are:

  1. switch from a patch.crates-io directive to individual universal-hash = { git = "..." } dependency directives per-crate
  2. release universal-hash v0.5 (or possibly a prerelease) and get the crate from crates.io
  3. comment out the cross CI tests for now

Given we have so few crates in this workspace, the 1st option seems easy enough and allows us to continue to run the cross tests.

@str4d
Copy link
Contributor Author

str4d commented Apr 21, 2022

I'm happy to do whatever is most convenient to the dependencies.

We should also test this against the AEAD crates to ensure the new traits achieve what we want. I locally tried to implement "simple" multiblock interleaving for chacha20poly1305 using this PR, and it got slower, so there is likely some more wrangling we need to do (either to the traits, or the way in which we use them to apply runtime checks).

@tarcieri
Copy link
Member

@str4d yeah, that was my experience from before trying to implement interleaving (i.e. it got slower)

My suspicion is that the unaligned load/stores to SIMD registers that occur in the cipher implementations and UHFs aren’t optimizing away properly allowing everything to flow through SIMD registers, but that’s just a guess

} else {
unsafe { (*self.inner.soft).update(x) }
unsafe { (*self.inner.soft).proc_block(x) }
Copy link
Member

@newpavlov newpavlov Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need some time to properly read changes in this PR, but the first thing which caught my eye is this part.

It looks like you misunderstood how the new API is supposed to be implemented. The idea is that update_with_backend passes to rank-2 closure f one of backends which was selected based on runtime detection of target features, thus "autodetection backend" should not exist. In other words, you should detect target features inside update_with_backend and call f with either CLMUL or software backend.

For reference, you can take a look at how ChaCha20 is implemented (for now you can ignore the code for "force" features). The inner functions are used to enable target features checked earlier, which can significantly improve codegen for body of the rank-2 closure after it gets inlined. I understand that this approach is not trivial, so feel free to ask any questions about it.

I am not surprised that with the current implementation you got degraded performance, since you effectively check target feature on each processed block. Not only it introduces branching, but also prevents compiler from applying important optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I did misunderstand the new traits; this makes much more sense. It would be helpful to document this on the traits themselves, as all I had to go off were the trait bounds, and the fact that UhfBackend: ParBlocksSizeUser and ParBlocksSizeUser: BlockSizeUser is what led me to believe that the top level struct should implement the backend.

@newpavlov
Copy link
Member

@tarcieri

I think we can close #148 and work off of this?

Currently it contains mostly non-code changes, which are not present here. After this PR will be merged I think I'll simply rebase it, so I think we can leave it open for now.

@str4d str4d force-pushed the universal-hash-0.5 branch 4 times, most recently from d3cc2c8 to 5947026 Compare April 21, 2022 09:35
@str4d
Copy link
Contributor Author

str4d commented Apr 21, 2022

Force-pushed to fix the dependency issue, and implement the UhfBackend traits as intended.

@tarcieri
Copy link
Member

@str4d looks like everything passed except rustfmt.

Did that help performance in the full AEAD construction?

@str4d
Copy link
Contributor Author

str4d commented Apr 21, 2022

@tarcieri I posted benchmarks here: RustCrypto/AEADs#414 (comment)

The universal-hash 0.5 upgrade makes things significantly faster with this PR, I presume because we now check backends twice per encryption call (once each for the stream and UHF) instead of once per block. Small decryptions are slightly slower, but large encryptions and decryptions are significantly faster.

However, implementing one-pass encryption on top of that still causes a huge (up to 90%) slowdown. To fix that I presume we'd need to lift the backend selection into the chacha20poly1305 crate. So we should think about how that would interact with the new traits, but if we can't figure that out, we at least get a performance win for two-pass encryption.

poly1305/src/backend/avx2.rs Outdated Show resolved Hide resolved
poly1305/src/backend/soft.rs Outdated Show resolved Hide resolved
This commit just switches to the new traits, and pretends that the
ideal number of parallel blocks is 1 (i.e. no faster than before).
@str4d
Copy link
Contributor Author

str4d commented Apr 24, 2022

Force-pushed to fix the bytes-vs-blocks confusion.

#[target_feature(enable = "avx2")]
pub(crate) unsafe fn compute_par_blocks(&mut self, blocks: &ParBlocks) {
assert!(self.partial_block.is_none());
assert_eq!(self.num_cached_blocks, 0);
Copy link
Contributor Author

@str4d str4d Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, now that I've fixed the bytes-vs-blocks confusion and the parallel blocks code is actually triggering, AEAD tests fail with this branch because the AD is added to the MAC before the ciphertext, simply by calling UniversalHash::update_padded twice:

        self.mac.update_padded(associated_data);

        // TODO(tarcieri): interleave encryption with Poly1305
        // See: <https://github.com/RustCrypto/AEADs/issues/74>
        self.cipher.apply_keystream(buffer);
        self.mac.update_padded(buffer);

Unless the AD length is [(4k + 3) * BLOCK_SIZE..(4k + 4) * BLOCK_SIZE], the parallel blocks won't line up.

We can handle this case specifically in the AEAD, but we should think about a way to handle this better in the traits, as calling UniversalHash::update_padded more than once is something users can easily assume should work (like we do). Currently users will just hit panics when their platform supports parallelism, but if we handled this off-by-one by falling back to non-parallelized, then it becomes very easy to not trigger the parallelism by accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggested change to UhfBackend: RustCrypto/traits#965 (comment)

If the Poly1305 state is unaligned (a previous update left the AVX2
backend with 1-3 cached blocks), this falls back to the single-block
API that incurs an additional copy per block.
@str4d
Copy link
Contributor Author

str4d commented Apr 24, 2022

Force-pushed to use a fallback instead of panicking on unaligned updates.

@tarcieri
Copy link
Member

Closing in favor of #155

@tarcieri tarcieri closed this Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants