-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to universal-hash 0.5
#153
Conversation
a256a7e
to
bd0f0ac
Compare
Thanks @str4d! @newpavlov I think we can close #148 and work off of this? The The solutions are:
Given we have so few crates in this workspace, the 1st option seems easy enough and allows us to continue to run the |
I'm happy to do whatever is most convenient to the dependencies. We should also test this against the AEAD crates to ensure the new traits achieve what we want. I locally tried to implement "simple" multiblock interleaving for |
@str4d yeah, that was my experience from before trying to implement interleaving (i.e. it got slower) My suspicion is that the unaligned load/stores to SIMD registers that occur in the cipher implementations and UHFs aren’t optimizing away properly allowing everything to flow through SIMD registers, but that’s just a guess |
polyval/src/backend/autodetect.rs
Outdated
} else { | ||
unsafe { (*self.inner.soft).update(x) } | ||
unsafe { (*self.inner.soft).proc_block(x) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need some time to properly read changes in this PR, but the first thing which caught my eye is this part.
It looks like you misunderstood how the new API is supposed to be implemented. The idea is that update_with_backend
passes to rank-2 closure f
one of backends which was selected based on runtime detection of target features, thus "autodetection backend" should not exist. In other words, you should detect target features inside update_with_backend
and call f
with either CLMUL or software backend.
For reference, you can take a look at how ChaCha20 is implemented (for now you can ignore the code for "force" features). The inner functions are used to enable target features checked earlier, which can significantly improve codegen for body of the rank-2 closure after it gets inlined. I understand that this approach is not trivial, so feel free to ask any questions about it.
I am not surprised that with the current implementation you got degraded performance, since you effectively check target feature on each processed block. Not only it introduces branching, but also prevents compiler from applying important optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I did misunderstand the new traits; this makes much more sense. It would be helpful to document this on the traits themselves, as all I had to go off were the trait bounds, and the fact that UhfBackend: ParBlocksSizeUser
and ParBlocksSizeUser: BlockSizeUser
is what led me to believe that the top level struct should implement the backend.
d3cc2c8
to
5947026
Compare
Force-pushed to fix the dependency issue, and implement the |
@str4d looks like everything passed except rustfmt. Did that help performance in the full AEAD construction? |
@tarcieri I posted benchmarks here: RustCrypto/AEADs#414 (comment) The However, implementing one-pass encryption on top of that still causes a huge (up to 90%) slowdown. To fix that I presume we'd need to lift the backend selection into the |
This commit just switches to the new traits, and pretends that the ideal number of parallel blocks is 1 (i.e. no faster than before).
5947026
to
e62c241
Compare
Force-pushed to fix the bytes-vs-blocks confusion. |
#[target_feature(enable = "avx2")] | ||
pub(crate) unsafe fn compute_par_blocks(&mut self, blocks: &ParBlocks) { | ||
assert!(self.partial_block.is_none()); | ||
assert_eq!(self.num_cached_blocks, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, now that I've fixed the bytes-vs-blocks confusion and the parallel blocks code is actually triggering, AEAD tests fail with this branch because the AD is added to the MAC before the ciphertext, simply by calling UniversalHash::update_padded
twice:
self.mac.update_padded(associated_data);
// TODO(tarcieri): interleave encryption with Poly1305
// See: <https://github.com/RustCrypto/AEADs/issues/74>
self.cipher.apply_keystream(buffer);
self.mac.update_padded(buffer);
Unless the AD length is [(4k + 3) * BLOCK_SIZE..(4k + 4) * BLOCK_SIZE]
, the parallel blocks won't line up.
We can handle this case specifically in the AEAD, but we should think about a way to handle this better in the traits, as calling UniversalHash::update_padded
more than once is something users can easily assume should work (like we do). Currently users will just hit panics when their platform supports parallelism, but if we handled this off-by-one by falling back to non-parallelized, then it becomes very easy to not trigger the parallelism by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggested change to UhfBackend
: RustCrypto/traits#965 (comment)
If the Poly1305 state is unaligned (a previous update left the AVX2 backend with 1-3 cached blocks), this falls back to the single-block API that incurs an additional copy per block.
e62c241
to
88eca39
Compare
Force-pushed to use a fallback instead of panicking on unaligned updates. |
Closing in favor of #155 |
No description provided.