New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incorrect output from the SSE2 assembly implementation on Windows #206
Comments
The SSE2 patch introduced The minimal fix is to use one of the saved registers instead of diff --git a/c/blake3_sse2_x86-64_windows_gnu.S b/c/blake3_sse2_x86-64_windows_gnu.S
index 424b4f8..8852ba5 100644
--- a/c/blake3_sse2_x86-64_windows_gnu.S
+++ b/c/blake3_sse2_x86-64_windows_gnu.S
@@ -2137,10 +2137,10 @@ _blake3_compress_in_place_sse2:
por xmm9, xmm8
movdqa xmm8, xmm7
punpcklqdq xmm8, xmm5
- movdqa xmm10, xmm6
+ movdqa xmm14, xmm6
pand xmm8, xmmword ptr [PBLENDW_0x3F_MASK+rip]
- pand xmm10, xmmword ptr [PBLENDW_0xC0_MASK+rip]
- por xmm8, xmm10
+ pand xmm14, xmmword ptr [PBLENDW_0xC0_MASK+rip]
+ por xmm8, xmm14
pshufd xmm8, xmm8, 0x78
punpckhdq xmm5, xmm7
punpckldq xmm6, xmm5
@@ -2268,10 +2268,10 @@ blake3_compress_xof_sse2:
por xmm9, xmm8
movdqa xmm8, xmm7
punpcklqdq xmm8, xmm5
- movdqa xmm10, xmm6
+ movdqa xmm14, xmm6
pand xmm8, xmmword ptr [PBLENDW_0x3F_MASK+rip]
- pand xmm10, xmmword ptr [PBLENDW_0xC0_MASK+rip]
- por xmm8, xmm10
+ pand xmm14, xmmword ptr [PBLENDW_0xC0_MASK+rip]
+ por xmm8, xmm14
pshufd xmm8, xmm8, 0x78
punpckhdq xmm5, xmm7
punpckldq xmm6, xmm5
diff --git a/c/blake3_sse2_x86-64_windows_msvc.asm b/c/blake3_sse2_x86-64_windows_msvc.asm
index ff9bb4d..507502f 100644
--- a/c/blake3_sse2_x86-64_windows_msvc.asm
+++ b/c/blake3_sse2_x86-64_windows_msvc.asm
@@ -2139,10 +2139,10 @@ _blake3_compress_in_place_sse2 PROC
por xmm9, xmm8
movdqa xmm8, xmm7
punpcklqdq xmm8, xmm5
- movdqa xmm10, xmm6
+ movdqa xmm14, xmm6
pand xmm8, xmmword ptr [PBLENDW_0x3F_MASK]
- pand xmm10, xmmword ptr [PBLENDW_0xC0_MASK]
- por xmm8, xmm10
+ pand xmm14, xmmword ptr [PBLENDW_0xC0_MASK]
+ por xmm8, xmm14
pshufd xmm8, xmm8, 78H
punpckhdq xmm5, xmm7
punpckldq xmm6, xmm5
@@ -2271,10 +2271,10 @@ _blake3_compress_xof_sse2 PROC
por xmm9, xmm8
movdqa xmm8, xmm7
punpcklqdq xmm8, xmm5
- movdqa xmm10, xmm6
+ movdqa xmm14, xmm6
pand xmm8, xmmword ptr [PBLENDW_0x3F_MASK]
- pand xmm10, xmmword ptr [PBLENDW_0xC0_MASK]
- por xmm8, xmm10
+ pand xmm14, xmmword ptr [PBLENDW_0xC0_MASK]
+ por xmm8, xmm14
pshufd xmm8, xmm8, 78H
punpckhdq xmm5, xmm7
punpckldq xmm6, xmm5 |
oconnor663
added a commit
that referenced
this issue
Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the rotations, but xmm6-xmm15 are callee-save registers on Windows, and SSE4.1 was only saving the registers it used. The minimal fix is to use one of the saved registers instead of xmm10. See #206.
oconnor663
added a commit
that referenced
this issue
Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the rotations, but xmm6-xmm15 are callee-save registers on Windows, and SSE4.1 was only saving the registers it used. The minimal fix is to use one of the saved registers instead of xmm10. See #206.
oconnor663
added a commit
that referenced
this issue
Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the rotations, but xmm6-xmm15 are callee-save registers on Windows, and SSE4.1 was only saving the registers it used. The minimal fix is to use one of the saved registers instead of xmm10. See #206.
oconnor663
added a commit
that referenced
this issue
Nov 5, 2021
Changes since 1.1.0: - SECURITY FIX: Fixed an instance of undefined behavior in the Windows SSE2 assembly implementations, which affected both the Rust and C libraries in their default build configurations. See #206. The cause was a vector register that wasn't properly saved and restored. This bug has been present since SSE2 support was initially added in v0.3.7. The effects of this bug depend on surrounding code and compiler optimizations; see test_issue_206_windows_sse2 for an example of this bug causing incorrect hash output. Note that even when surrounding code is arranged to trigger this bug, the SSE2 implementation is normally only invoked on CPUs where SSE4.1 (introduced in 2007) isn't supported. One notable exception, however, is if the Rust library is built in `no_std` mode, with `default_features = false` or similar. In that case, runtime CPU feature detection is disabled, and since LLVM assumes that all x86-64 targets support SSE2, the SSE2 implementation will be invoked. For that reason, Rust callers who build `blake3` in `no_std` mode for x86-64 Windows targets are the most likely to trigger this bug. We found this bug in internal testing, and we aren't aware of any callers encountering it in practice. - Added the Hasher::count() method.
Fixed and released as v1.2.0. |
kevingoh
pushed a commit
to ITS-AT-dev/BLAKE3
that referenced
this issue
Oct 23, 2023
The SSE2 patch introduced xmm10 as a temporary register for one of the rotations, but xmm6-xmm15 are callee-save registers on Windows, and SSE4.1 was only saving the registers it used. The minimal fix is to use one of the saved registers instead of xmm10. See BLAKE3-team#206.
kevingoh
pushed a commit
to ITS-AT-dev/BLAKE3
that referenced
this issue
Oct 23, 2023
Changes since 1.1.0: - SECURITY FIX: Fixed an instance of undefined behavior in the Windows SSE2 assembly implementations, which affected both the Rust and C libraries in their default build configurations. See BLAKE3-team#206. The cause was a vector register that wasn't properly saved and restored. This bug has been present since SSE2 support was initially added in v0.3.7. The effects of this bug depend on surrounding code and compiler optimizations; see test_issue_206_windows_sse2 for an example of this bug causing incorrect hash output. Note that even when surrounding code is arranged to trigger this bug, the SSE2 implementation is normally only invoked on CPUs where SSE4.1 (introduced in 2007) isn't supported. One notable exception, however, is if the Rust library is built in `no_std` mode, with `default_features = false` or similar. In that case, runtime CPU feature detection is disabled, and since LLVM assumes that all x86-64 targets support SSE2, the SSE2 implementation will be invoked. For that reason, Rust callers who build `blake3` in `no_std` mode for x86-64 Windows targets are the most likely to trigger this bug. We found this bug in internal testing, and we aren't aware of any callers encountering it in practice. - Added the Hasher::count() method.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've found an incorrect output from our SSE2 assembly implementation on Windows. Here's the original CI run that caught it. The feature branch under test there had hackily enabled Rayon across the board, which exercised some configurations that were never previously tested. After minimizing the failure case a bit, it turned out it wasn't specific to Rayon at all, but it does seem to be very sensitive to the details of surrounding code. I've put up a branch with a test case that fails on master, minimized as best I can. Here's the failing test added by that commit:
That test is pretty wacky, but as far as I can tell, most of the details there are actually required to trigger the bug, including the weird
for
loop that just executes once. (It doesn't seem to be overly sensitive to the compiler version though. I've tried it on rustc 1.54, 1.56, and nightly, and it seems to behave the same across those.)That test passes on my Windows box with
cargo test
but fails with:--features=no_avx512,no_avx2,no_sse41
forces us into the SSE2 implementation by disabling everything else. The--release
flag there is also required to trigger the failure; it doesn't happen in debug mode. Here's a CI run on this commit, which demonstrates the failure. Note that this run only fails on Windows MSVC, but my original run above failed on Windows GNU also, so both implementations are probably affected. I haven't seen any failures yet with the intrinsics implementation (which you can test with theprefer_intrinsics
feature).When I try to backport this test to the original SSE2 PR, I can repro the failure there. So whatever the issue is, I think it was present in the original implementation rather than something introduced since.
cc @sneves @mkrupcale
The text was updated successfully, but these errors were encountered: