Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving cipher parallelism #444

Open
tarcieri opened this issue Dec 30, 2020 · 7 comments
Open

Improving cipher parallelism #444

tarcieri opened this issue Dec 30, 2020 · 7 comments
Labels
cipher Block and stream cipher crate

Comments

@tarcieri
Copy link
Member

tarcieri commented Dec 30, 2020

Taking a step back from #354, I thought it'd be good to look how and where ILP and SIMD parallelism is currently used across the project as a whole, and how that could be improved.

The only place we presently have any sort of parallelism abstraction at the trait-level is BlockCipher::ParBlocks. Otherwise various crates leverage e.g. SIMD internally. Regarding BlockCipher::ParBlocks specifically, the only crate that leverages it is the aes crate.

The following crates have SIMD backends:

Ciphers

  • aes
  • chacha20

UHFs/"MACs"

  • polyval
  • poly1305

AEADs

In AEADs, we'd like to glue the above crates together in fairly fixed combinations in order to leverage ILP, passing SIMD buffers from ciphers to UHFs for authentication:

  • aes-gcm/aes-gcm-siv: aes + ghash/polyval
  • chacha20poly1305: chacha20 + poly1305

(also aes-siv and pmac, but this is less of a priority)

In either of these cases there's a single specific buffer type I think it'd be nice for both the cipher implementation and UHF to support in common:

  • aes-gcm/aes-gcm-siv: "i128x8" i.e. [__m128i; 8] on x86/x86_64
  • chacha20poly1305: "i256x4" i.e. [__m256i; 4] on x86/x86_64

Concrete proposal

My suggestion is to get rid of BlockCipher::ParBlocks and replace it with more general SIMD types and traits designed to work with them, namely:

  • Add a new utils crate e.g. simd-buffers which provides "i128x8" and "i256x4" SIMD buffer types which are backed by __m128i/__m256i on x86/x86_64 and otherwise provide a portable implementation. These types don't need to implement any sort of arithmetic, just provide wrappers for passing data between SIMD implementations.
  • Add traits to cipher and universal-hash which operate on SIMD buffers.
  • Use SIMD buffers types in the implementations of aes-gcm, aes-gcm-siv, and chacha20poly1305

cipher API suggestion

I'd suggest adding traits to cipher which use the SIMD buffer types which are useful for both block ciphers and stream ciphers.

I also think it might make sense to use a generic parameter rather than an associated type to permit support for multiple buffer types (e.g. on newer CPUs, "i128x4" might be a better option for AES, but we can support both):

/// Note that for practical purposes, we only need to support block cipher encryption,
/// but there could also be a `BlockDecryptPar` for completeness/consistency.
pub trait BlockEncryptPar<B: SimdBuffer> {
    fn encrypt_par(&self, buffer: &mut B);
}

pub trait StreamCipherPar<B: SimdBuffer> {
    fn try_apply_keystream_par(&mut self, buffer: &mut B) -> Result<(), LoopError>;
}

universal-hash API suggestion

pub trait UniversalHashPar<B: SimdBuffer> {
    fn update_par(&mut self, blocks: &B);
}

SIMD ctr support

Trying to move the end-user facing aes-ctr types into aes has created a very annoying circular dependency between the block-ciphers and stream-ciphers repo. Furthermore, ctr is quite a bit more general now than what the CTR types in the aes crate provide, and also aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

If we had BlockEncryptPar and StreamCipherPar traits, the ctr crate could glue the two together, accepting a SIMD buffer as input, computing the next buffer of keystream output, and XORing the latter into the former. This would allow ctr to be generally SIMD optimized, and also mean we only have one ctr implementation to worry about instead of a separate one in the AES crate.

@newpavlov
Copy link
Member

I would love to remove ParBlocks, but right now I don't see how your proposal will help us to improve our API. If anything, I think it needlessly will expose implementation detail in the form of SIMD types, without solving the core issue: with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX). It was the main stumbling block for my attempt to introduce "core stream API".

I think a better direction will be to design an API which will not expose the number of parallel blocks at type level altogether. Though it probably will make some things more difficult or slightly less efficient, e.g. stacking MAC + cipher algorithms in one-pass fashion.

Note that ctr keeps only one processed block in its state. I think other crates also should not depend on ParBlocks in their states. It may reduce performance a bit if message is processed in very small chunks, but it will improve state size.

created a very annoying circular dependency between the block-ciphers and stream-ciphers repo.

Currently we have to depend on ctr to reduce code duplication around block-buffer. I think it should be fixed by introducing "core stream API" and adding block-buffer wrapper to cipher. This way aes no longer will depend on ctr, thus removing the circular dependency.

aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

It's a temporary state since the flavors were introduced relatively recently.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

No, the main benefit is that it makes it much easier for compiler to keep data in XMM registers without spilling them to stack. At this level it can create a noticeable performance regression.

@tarcieri
Copy link
Member Author

tarcieri commented Dec 30, 2020

I think it needlessly will expose implementation detail in the form of SIMD types

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

I think the best way to avoid that is have native SIMD types that allow e.g. a cipher crate to hand off a SIMD buffer directly to a universal-hash crate.

with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX).

By using a trait with a generic parameter, a cipher or UHF type can implement several SIMD buffer types, with the specific one selected at runtime.

If anything, ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

it much easier for compiler to keep data in XMM registers without spilling them to stack.

I think the SIMD buffer types can provide that as well, particularly if we conditionally provide unsafe functions for operating on them with #[target_feature(enable = "...")] annotations for use in SIMD contexts, with a higher-level crate doing runtime detection of the desired SIMD feature combinations.

@newpavlov
Copy link
Member

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

The compiler is allowed to remove subsequent load/stores. Though it's indeed will not be easy to design API in such way which will allow compiler to reliably do it.

ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

Can't the same approach be applied to ParBlocks? It also can be a generic parameter instead of associated type.

The problem is that with runtime detection we have two runtime switches in each crate, so compiler can not keep data in registers, since it sees branches on different memory locations, so it can not merge them.

I don't think we can have a reliable solution without an ability to compile dependency crates several times with different feature flags. With such feature we would've been able to push runtime detection as high as possible without sacrificing runtime switch capabilities at lower levels. But unfortunately there is not even pre-RFC for such feature and I think the community and the lang team do not currently have enough interest in improving situation in this area.

@tarcieri
Copy link
Member Author

Can't the same approach be applied tol ParBlocks? It also can be a generic parameter instead of associated type.

Yes, but only if there were a separate trait for parallel operations. It wouldn't make sense to e.g. convert the current BlockCipher::ParBlocks associated type into a generic parameter of BlockCipher.

But then the question remains: what makes for a better SIMD buffer, a GenericArray<u8, _>, or dedicated types which can wrap SIMD registers?

I don't think using a GenericArray for this makes sense for a couple reasons:

  • As noted above, the cardinality of useful parallel buffer types is presently at 2
  • In general we have SIMD composability problems because all of the current SIMD code is written at a fairly low level, and I think this is due to a lack of shared SIMD types where we can centralize abstractions

@tarcieri tarcieri added the cipher Block and stream cipher crate label Jan 4, 2021
tarcieri added a commit to RustCrypto/utils that referenced this issue Jan 22, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
@tarcieri
Copy link
Member Author

I opened RustCrypto/utils#221 which contains a WIP prototype of a simd-buffers crate.

@tarcieri tarcieri mentioned this issue Feb 2, 2021
2 tasks
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit to RustCrypto/utils that referenced this issue Feb 5, 2021
Implements the following SIMD types, as proposed in
RustCrypto/traits#444:

- `U128` (portable)
- `U256` (x86/x86_64 only)
- `U128x8` (portable)

These types are largely "storage only" and don't implement arithmetic
(if we needed that, `stdsimd`/`packed_simd` would be a better choice)

The implementation *does* expose optimized XOR intrinsics, however,
which seems to be the main thing useful in a portable cryptographic
context, at least as far as our current usages of SIMD go.

The `x86` backend exposes unsafe `target_feature(enable = "...")`
functions as part of its API, intended to be used/inlined within SIMD
backends for particular algorithms.
tarcieri added a commit that referenced this issue Feb 5, 2021
Adds traits which are generic around a buffer type, allowing more
precise control of inputs to encryption and universal hash algorithms.

Potentially resolves #159 and #444.
@newpavlov
Copy link
Member

I think we can close this issue with cipher v0.4 being released. It uses a different approach than outlined here, but it works good enough.

@tarcieri
Copy link
Member Author

I'm reopening this as we continue to get complaints about performance, and attempting to implement one-pass operation for either AES-GCM or ChaCha20Poly1305 does not yield the expected speedups:

@tarcieri tarcieri reopened this May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cipher Block and stream cipher crate
Projects
None yet
Development

No branches or pull requests

2 participants