Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keccak: add asm feature; use cpufeatures on aarch64 #24

Merged
merged 2 commits into from
Nov 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 5 additions & 1 deletion keccak/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,9 @@ categories = ["cryptography", "no-std"]
readme = "README.md"

[features]
asm = [] # Use optimized assembly when available (currently only ARMv8)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this perhaps be a cfg! attribute rather than a crate feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am not sure. Application developers would probably want to have it enabled by default without any additional steps from users. It's a slightly different situation from the aes crate where configuration flags are used mostly for testing backends.

no_unroll = [] # Do no unroll loops for binary size reduction
simd = [] # Use core::simd (WARNING: requires Nigthly)
simd = [] # Use core::simd (WARNING: requires Nigthly)

[target.'cfg(target_arch = "aarch64")'.dependencies]
cpufeatures = "0.2"
Comment on lines +21 to +22
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate this is a hard dependency on aarch64, although I couldn't figure out how to gate it better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about gating on feature or configuration flag inside the cfg statement?

Copy link
Member Author

@tarcieri tarcieri Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work:

[target.'cfg(all(target_arch = "aarch64", feature = "asm"))'.dependencies]
cpufeatures = "0.2"

It prints this warning:

warning: Found feature = ... in target.'cfg(...)'.dependencies. This key is not supported for selecting dependencies and will not work as expected. Use the [features] section instead: https://doc.rust-lang.org/cargo/reference/features.html

This however, seems to work:

[target.'cfg(all(keccak_asm, target_arch = "aarch64"))'.dependencies]
cpufeatures = "0.2"

So that's a possible argument for using a cfg attribute for gating rather than a Cargo feature.

225 changes: 111 additions & 114 deletions keccak/src/aarch64_sha3.rs
Original file line number Diff line number Diff line change
@@ -1,130 +1,127 @@
#![cfg(all(target_arch = "aarch64", target_feature = "sha3"))]

/// Keccak-f1600 on ARMv8.4-A with FEAT_SHA3.
///
/// See p. K12.2.2 p. 11,749 of the ARM Reference manual.
/// Adapted from the Keccak-f1600 implementation in the XKCP/K12.
/// see <https://github.com/XKCP/K12/blob/df6a21e6d1f34c1aa36e8d702540899c97dba5a0/lib/ARMv8Asha3/KeccakP-1600-ARMv8Asha3.S#L69>
pub fn keccak_f1600(state: &mut [u64; 25]) {
unsafe {
core::arch::asm!("
// Read state
ld1.1d {{ v0- v3}}, [x0], #32
ld1.1d {{ v4- v7}}, [x0], #32
ld1.1d {{ v8-v11}}, [x0], #32
ld1.1d {{v12-v15}}, [x0], #32
ld1.1d {{v16-v19}}, [x0], #32
ld1.1d {{v20-v23}}, [x0], #32
ld1.1d {{v24}}, [x0]
sub x0, x0, #192
#[target_feature(enable = "sha3")]
pub unsafe fn f1600_asm(state: &mut [u64; 25]) {
core::arch::asm!("
// Read state
ld1.1d {{ v0- v3}}, [x0], #32
ld1.1d {{ v4- v7}}, [x0], #32
ld1.1d {{ v8-v11}}, [x0], #32
ld1.1d {{v12-v15}}, [x0], #32
ld1.1d {{v16-v19}}, [x0], #32
ld1.1d {{v20-v23}}, [x0], #32
ld1.1d {{v24}}, [x0]
sub x0, x0, #192

// Loop 24 rounds
// NOTE: This loop actually computes two f1600 functions in
// parallel, in both the lower and the upper 64-bit of the
// 128-bit registers v0-v24.
mov x8, #24
0: sub x8, x8, #1
// Loop 24 rounds
// NOTE: This loop actually computes two f1600 functions in
// parallel, in both the lower and the upper 64-bit of the
// 128-bit registers v0-v24.
mov x8, #24
0: sub x8, x8, #1

// Theta Calculations
eor3.16b v25, v20, v15, v10
eor3.16b v26, v21, v16, v11
eor3.16b v27, v22, v17, v12
eor3.16b v28, v23, v18, v13
eor3.16b v29, v24, v19, v14
eor3.16b v25, v25, v5, v0
eor3.16b v26, v26, v6, v1
eor3.16b v27, v27, v7, v2
eor3.16b v28, v28, v8, v3
eor3.16b v29, v29, v9, v4
rax1.2d v30, v25, v27
rax1.2d v31, v26, v28
rax1.2d v27, v27, v29
rax1.2d v28, v28, v25
rax1.2d v29, v29, v26

// Rho and Phi
eor.16b v0, v0, v29
xar.2d v25, v1, v30, #64 - 1
xar.2d v1, v6, v30, #64 - 44
xar.2d v6, v9, v28, #64 - 20
xar.2d v9, v22, v31, #64 - 61
xar.2d v22, v14, v28, #64 - 39
xar.2d v14, v20, v29, #64 - 18
xar.2d v26, v2, v31, #64 - 62
xar.2d v2, v12, v31, #64 - 43
xar.2d v12, v13, v27, #64 - 25
xar.2d v13, v19, v28, #64 - 8
xar.2d v19, v23, v27, #64 - 56
xar.2d v23, v15, v29, #64 - 41
xar.2d v15, v4, v28, #64 - 27
xar.2d v28, v24, v28, #64 - 14
xar.2d v24, v21, v30, #64 - 2
xar.2d v8, v8, v27, #64 - 55
xar.2d v4, v16, v30, #64 - 45
xar.2d v16, v5, v29, #64 - 36
xar.2d v5, v3, v27, #64 - 28
xar.2d v27, v18, v27, #64 - 21
xar.2d v3, v17, v31, #64 - 15
xar.2d v30, v11, v30, #64 - 10
xar.2d v31, v7, v31, #64 - 6
xar.2d v29, v10, v29, #64 - 3
// Theta Calculations
eor3.16b v25, v20, v15, v10
eor3.16b v26, v21, v16, v11
eor3.16b v27, v22, v17, v12
eor3.16b v28, v23, v18, v13
eor3.16b v29, v24, v19, v14
eor3.16b v25, v25, v5, v0
eor3.16b v26, v26, v6, v1
eor3.16b v27, v27, v7, v2
eor3.16b v28, v28, v8, v3
eor3.16b v29, v29, v9, v4
rax1.2d v30, v25, v27
rax1.2d v31, v26, v28
rax1.2d v27, v27, v29
rax1.2d v28, v28, v25
rax1.2d v29, v29, v26

// Chi and Iota
bcax.16b v20, v26, v22, v8
bcax.16b v21, v8, v23, v22
bcax.16b v22, v22, v24, v23
bcax.16b v23, v23, v26, v24
bcax.16b v24, v24, v8, v26

ld1r.2d {{v26}}, [x1], #8
// Rho and Phi
eor.16b v0, v0, v29
xar.2d v25, v1, v30, #64 - 1
xar.2d v1, v6, v30, #64 - 44
xar.2d v6, v9, v28, #64 - 20
xar.2d v9, v22, v31, #64 - 61
xar.2d v22, v14, v28, #64 - 39
xar.2d v14, v20, v29, #64 - 18
xar.2d v26, v2, v31, #64 - 62
xar.2d v2, v12, v31, #64 - 43
xar.2d v12, v13, v27, #64 - 25
xar.2d v13, v19, v28, #64 - 8
xar.2d v19, v23, v27, #64 - 56
xar.2d v23, v15, v29, #64 - 41
xar.2d v15, v4, v28, #64 - 27
xar.2d v28, v24, v28, #64 - 14
xar.2d v24, v21, v30, #64 - 2
xar.2d v8, v8, v27, #64 - 55
xar.2d v4, v16, v30, #64 - 45
xar.2d v16, v5, v29, #64 - 36
xar.2d v5, v3, v27, #64 - 28
xar.2d v27, v18, v27, #64 - 21
xar.2d v3, v17, v31, #64 - 15
xar.2d v30, v11, v30, #64 - 10
xar.2d v31, v7, v31, #64 - 6
xar.2d v29, v10, v29, #64 - 3

bcax.16b v17, v30, v19, v3
bcax.16b v18, v3, v15, v19
bcax.16b v19, v19, v16, v15
bcax.16b v15, v15, v30, v16
bcax.16b v16, v16, v3, v30

bcax.16b v10, v25, v12, v31
bcax.16b v11, v31, v13, v12
bcax.16b v12, v12, v14, v13
bcax.16b v13, v13, v25, v14
bcax.16b v14, v14, v31, v25
// Chi and Iota
bcax.16b v20, v26, v22, v8
bcax.16b v21, v8, v23, v22
bcax.16b v22, v22, v24, v23
bcax.16b v23, v23, v26, v24
bcax.16b v24, v24, v8, v26

bcax.16b v7, v29, v9, v4
bcax.16b v8, v4, v5, v9
bcax.16b v9, v9, v6, v5
bcax.16b v5, v5, v29, v6
bcax.16b v6, v6, v4, v29

bcax.16b v3, v27, v0, v28
bcax.16b v4, v28, v1, v0
bcax.16b v0, v0, v2, v1
bcax.16b v1, v1, v27, v2
bcax.16b v2, v2, v28, v27
ld1r.2d {{v26}}, [x1], #8

eor.16b v0,v0,v26
bcax.16b v17, v30, v19, v3
bcax.16b v18, v3, v15, v19
bcax.16b v19, v19, v16, v15
bcax.16b v15, v15, v30, v16
bcax.16b v16, v16, v3, v30

// Rounds loop
cbnz w8, 0b
bcax.16b v10, v25, v12, v31
bcax.16b v11, v31, v13, v12
bcax.16b v12, v12, v14, v13
bcax.16b v13, v13, v25, v14
bcax.16b v14, v14, v31, v25

// Write state
st1.1d {{ v0- v3}}, [x0], #32
st1.1d {{ v4- v7}}, [x0], #32
st1.1d {{ v8-v11}}, [x0], #32
st1.1d {{v12-v15}}, [x0], #32
st1.1d {{v16-v19}}, [x0], #32
st1.1d {{v20-v23}}, [x0], #32
st1.1d {{v24}}, [x0]
",
in("x0") state.as_mut_ptr(),
in("x1") crate::RC.as_ptr(),
clobber_abi("C"),
options(nostack)
);
}
bcax.16b v7, v29, v9, v4
bcax.16b v8, v4, v5, v9
bcax.16b v9, v9, v6, v5
bcax.16b v5, v5, v29, v6
bcax.16b v6, v6, v4, v29

bcax.16b v3, v27, v0, v28
bcax.16b v4, v28, v1, v0
bcax.16b v0, v0, v2, v1
bcax.16b v1, v1, v27, v2
bcax.16b v2, v2, v28, v27

eor.16b v0,v0,v26

// Rounds loop
cbnz w8, 0b

// Write state
st1.1d {{ v0- v3}}, [x0], #32
st1.1d {{ v4- v7}}, [x0], #32
st1.1d {{ v8-v11}}, [x0], #32
st1.1d {{v12-v15}}, [x0], #32
st1.1d {{v16-v19}}, [x0], #32
st1.1d {{v20-v23}}, [x0], #32
st1.1d {{v24}}, [x0]
",
in("x0") state.as_mut_ptr(),
in("x1") crate::RC.as_ptr(),
clobber_abi("C"),
options(nostack)
);
}

#[cfg(test)]
#[cfg(all(test, target_feature = "sha3"))]
mod tests {
use super::*;

Expand Down Expand Up @@ -188,9 +185,9 @@ mod tests {
];

let mut state = [0u64; 25];
keccak_f1600(&mut state);
unsafe { keccak_f1600(&mut state) };
assert_eq!(state, state_first);
keccak_f1600(&mut state);
unsafe { keccak_f1600(&mut state) };
assert_eq!(state, state_second);
}
}
20 changes: 17 additions & 3 deletions keccak/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,16 @@ use core::{

#[rustfmt::skip]
mod unroll;

#[cfg(all(target_arch = "aarch64", feature = "asm"))]
mod aarch64_sha3;

#[cfg(all(target_arch = "aarch64", feature = "asm"))]
pub use aarch64_sha3::f1600_asm;

#[cfg(all(target_arch = "aarch64", feature = "asm"))]
cpufeatures::new!(armv8_sha3_intrinsics, "sha3");

const PLEN: usize = 25;

const RHO: [u32; 24] = [
Expand Down Expand Up @@ -145,11 +153,17 @@ impl_keccak!(f200, u8);
impl_keccak!(f400, u16);
impl_keccak!(f800, u32);

#[cfg(not(all(target_arch = "aarch64", target_feature = "sha3")))]
#[cfg(not(all(target_arch = "aarch64", feature = "asm")))]
impl_keccak!(f1600, u64);

#[cfg(all(target_arch = "aarch64", target_feature = "sha3"))]
pub use aarch64_sha3::keccak_f1600 as f1600;
#[cfg(all(target_arch = "aarch64", feature = "asm"))]
pub fn f1600(state: &mut [u64; PLEN]) {
if armv8_sha3_intrinsics::get() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@newpavlov there's not really a way to make this interface work with an init token. Does it seem ok to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not ideal, but should be fine. One potential alternative is to expose an unsafe gated f1600_aarch64_asm function and do the switch inside sha3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works. I can update the PR.

Copy link
Member

@newpavlov newpavlov Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying a separate function is a better approach. Just floating an idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems nice to have to me, especially for the sha3 use case.

It's unsafe and #[target_feature(enable = "sha3")]-gated, which should prevent casual misuse.

unsafe { f1600_asm(state) }
} else {
keccak_p(state, u64::KECCAK_F_ROUND_COUNT);
}
}

#[cfg(feature = "simd")]
/// SIMD implementations for Keccak-f1600 sponge function
Expand Down