Migrate to `universal-hash 0.5` #153

str4d · 2022-04-17T18:14:01Z

No description provided.

tarcieri · 2022-04-20T23:38:04Z

@newpavlov I think we can close #148 and work off of this?

The cross-related errors are because it doesn't pick up the workspace-level Cargo.toml because it copies the current crate (not the entire workspace) into a Docker container. Because of that it doesn't see the patch.crates-io directive.

The solutions are:

switch from a patch.crates-io directive to individual universal-hash = { git = "..." } dependency directives per-crate
release universal-hash v0.5 (or possibly a prerelease) and get the crate from crates.io
comment out the cross CI tests for now

Given we have so few crates in this workspace, the 1st option seems easy enough and allows us to continue to run the cross tests.

str4d · 2022-04-21T00:04:10Z

I'm happy to do whatever is most convenient to the dependencies.

We should also test this against the AEAD crates to ensure the new traits achieve what we want. I locally tried to implement "simple" multiblock interleaving for chacha20poly1305 using this PR, and it got slower, so there is likely some more wrangling we need to do (either to the traits, or the way in which we use them to apply runtime checks).

tarcieri · 2022-04-21T00:24:15Z

@str4d yeah, that was my experience from before trying to implement interleaving (i.e. it got slower)

My suspicion is that the unaligned load/stores to SIMD registers that occur in the cipher implementations and UHFs aren’t optimizing away properly allowing everything to flow through SIMD registers, but that’s just a guess

newpavlov · 2022-04-21T06:35:18Z

polyval/src/backend/autodetect.rs

        } else {
-            unsafe { (*self.inner.soft).update(x) }
+            unsafe { (*self.inner.soft).proc_block(x) }


I will need some time to properly read changes in this PR, but the first thing which caught my eye is this part.

It looks like you misunderstood how the new API is supposed to be implemented. The idea is that update_with_backend passes to rank-2 closure f one of backends which was selected based on runtime detection of target features, thus "autodetection backend" should not exist. In other words, you should detect target features inside update_with_backend and call f with either CLMUL or software backend.

For reference, you can take a look at how ChaCha20 is implemented (for now you can ignore the code for "force" features). The inner functions are used to enable target features checked earlier, which can significantly improve codegen for body of the rank-2 closure after it gets inlined. I understand that this approach is not trivial, so feel free to ask any questions about it.

I am not surprised that with the current implementation you got degraded performance, since you effectively check target feature on each processed block. Not only it introduces branching, but also prevents compiler from applying important optimizations.

Indeed I did misunderstand the new traits; this makes much more sense. It would be helpful to document this on the traits themselves, as all I had to go off were the trait bounds, and the fact that UhfBackend: ParBlocksSizeUser and ParBlocksSizeUser: BlockSizeUser is what led me to believe that the top level struct should implement the backend.

newpavlov · 2022-04-21T06:39:29Z

@tarcieri

I think we can close #148 and work off of this?

Currently it contains mostly non-code changes, which are not present here. After this PR will be merged I think I'll simply rebase it, so I think we can leave it open for now.

str4d · 2022-04-21T09:41:41Z

Force-pushed to fix the dependency issue, and implement the UhfBackend traits as intended.

tarcieri · 2022-04-21T14:52:26Z

@str4d looks like everything passed except rustfmt.

Did that help performance in the full AEAD construction?

str4d · 2022-04-21T16:35:33Z

@tarcieri I posted benchmarks here: RustCrypto/AEADs#414 (comment)

The universal-hash 0.5 upgrade makes things significantly faster with this PR, I presume because we now check backends twice per encryption call (once each for the stream and UHF) instead of once per block. Small decryptions are slightly slower, but large encryptions and decryptions are significantly faster.

However, implementing one-pass encryption on top of that still causes a huge (up to 90%) slowdown. To fix that I presume we'd need to lift the backend selection into the chacha20poly1305 crate. So we should think about how that would interact with the new traits, but if we can't figure that out, we at least get a performance win for two-pass encryption.

poly1305/src/backend/avx2.rs

poly1305/src/backend/soft.rs

This commit just switches to the new traits, and pretends that the ideal number of parallel blocks is 1 (i.e. no faster than before).

str4d · 2022-04-24T13:07:27Z

Force-pushed to fix the bytes-vs-blocks confusion.

str4d · 2022-04-24T13:12:53Z

poly1305/src/backend/avx2.rs

+    #[target_feature(enable = "avx2")]
+    pub(crate) unsafe fn compute_par_blocks(&mut self, blocks: &ParBlocks) {
+        assert!(self.partial_block.is_none());
+        assert_eq!(self.num_cached_blocks, 0);


Hmm, now that I've fixed the bytes-vs-blocks confusion and the parallel blocks code is actually triggering, AEAD tests fail with this branch because the AD is added to the MAC before the ciphertext, simply by calling UniversalHash::update_padded twice:

self.mac.update_padded(associated_data); // TODO(tarcieri): interleave encryption with Poly1305 // See: <https://github.com/RustCrypto/AEADs/issues/74> self.cipher.apply_keystream(buffer); self.mac.update_padded(buffer);

Unless the AD length is [(4k + 3) * BLOCK_SIZE..(4k + 4) * BLOCK_SIZE], the parallel blocks won't line up.

We can handle this case specifically in the AEAD, but we should think about a way to handle this better in the traits, as calling UniversalHash::update_padded more than once is something users can easily assume should work (like we do). Currently users will just hit panics when their platform supports parallelism, but if we handled this off-by-one by falling back to non-parallelized, then it becomes very easy to not trigger the parallelism by accident.

Added a suggested change to UhfBackend: RustCrypto/traits#965 (comment)

If the Poly1305 state is unaligned (a previous update left the AVX2 backend with 1-3 cached blocks), this falls back to the single-block API that incurs an additional copy per block.

str4d · 2022-04-24T15:26:43Z

Force-pushed to use a fallback instead of panicking on unaligned updates.

tarcieri · 2022-07-17T15:07:27Z

Closing in favor of #155

str4d force-pushed the universal-hash-0.5 branch from a256a7e to bd0f0ac Compare April 17, 2022 18:17

newpavlov reviewed Apr 21, 2022

View reviewed changes

str4d force-pushed the universal-hash-0.5 branch 4 times, most recently from d3cc2c8 to 5947026 Compare April 21, 2022 09:35

str4d mentioned this pull request Apr 21, 2022

chacha20poly1305: Implement one-pass encryption and decryption RustCrypto/AEADs#414

Draft

str4d mentioned this pull request Apr 21, 2022

chacha20poly1305: Migrate to universal-hash 0.5 RustCrypto/AEADs#415

Closed

tarcieri mentioned this pull request Apr 23, 2022

universal-hash v0.5 RustCrypto/traits#965

Closed

str4d commented Apr 24, 2022

View reviewed changes

poly1305/src/backend/avx2.rs Outdated Show resolved Hide resolved

str4d commented Apr 24, 2022

View reviewed changes

poly1305/src/backend/soft.rs Outdated Show resolved Hide resolved

Migrate to universal-hash 0.5

eb30dc0

This commit just switches to the new traits, and pretends that the ideal number of parallel blocks is 1 (i.e. no faster than before).

str4d force-pushed the universal-hash-0.5 branch from 5947026 to e62c241 Compare April 24, 2022 12:41

str4d commented Apr 24, 2022

View reviewed changes

poly1305: Expose 4-block processing on the AVX2 backend

88eca39

If the Poly1305 state is unaligned (a previous update left the AVX2 backend with 1-3 cached blocks), this falls back to the single-block API that incurs an additional copy per block.

str4d force-pushed the universal-hash-0.5 branch from e62c241 to 88eca39 Compare April 24, 2022 15:19

tarcieri mentioned this pull request Jul 17, 2022

Migrate to universal-hash v0.5.0-pre #155

Merged

tarcieri closed this Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to `universal-hash 0.5` #153

Migrate to `universal-hash 0.5` #153

str4d commented Apr 17, 2022

tarcieri commented Apr 20, 2022 •

edited

str4d commented Apr 21, 2022

tarcieri commented Apr 21, 2022

newpavlov Apr 21, 2022 •

edited

str4d Apr 21, 2022

newpavlov commented Apr 21, 2022

str4d commented Apr 21, 2022

tarcieri commented Apr 21, 2022

str4d commented Apr 21, 2022 •

edited

str4d commented Apr 24, 2022

str4d Apr 24, 2022 •

edited

str4d Apr 24, 2022

str4d commented Apr 24, 2022

tarcieri commented Jul 17, 2022

Migrate to universal-hash 0.5 #153

Migrate to universal-hash 0.5 #153

Conversation

str4d commented Apr 17, 2022

tarcieri commented Apr 20, 2022 • edited

str4d commented Apr 21, 2022

tarcieri commented Apr 21, 2022

newpavlov Apr 21, 2022 • edited

Choose a reason for hiding this comment

str4d Apr 21, 2022

Choose a reason for hiding this comment

newpavlov commented Apr 21, 2022

str4d commented Apr 21, 2022

tarcieri commented Apr 21, 2022

str4d commented Apr 21, 2022 • edited

str4d commented Apr 24, 2022

str4d Apr 24, 2022 • edited

Choose a reason for hiding this comment

str4d Apr 24, 2022

Choose a reason for hiding this comment

str4d commented Apr 24, 2022

tarcieri commented Jul 17, 2022

Migrate to `universal-hash 0.5` #153

Migrate to `universal-hash 0.5` #153

tarcieri commented Apr 20, 2022 •

edited

newpavlov Apr 21, 2022 •

edited

str4d commented Apr 21, 2022 •

edited

str4d Apr 24, 2022 •

edited