Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect output from the SSE2 assembly implementation on Windows #206

Closed
oconnor663 opened this issue Nov 5, 2021 · 2 comments
Closed

Comments

@oconnor663
Copy link
Member

I've found an incorrect output from our SSE2 assembly implementation on Windows. Here's the original CI run that caught it. The feature branch under test there had hackily enabled Rayon across the board, which exercised some configurations that were never previously tested. After minimizing the failure case a bit, it turned out it wasn't specific to Rayon at all, but it does seem to be very sensitive to the details of surrounding code. I've put up a branch with a test case that fails on master, minimized as best I can. Here's the failing test added by that commit:

#[test]
fn test_trigger_sse2() {
    // This stupid loop has to be here. I don't know why.
    for _ in &[0] {
        // The length 65 (two blocks) is significant. It doesn't repro with 64 (one block).
        // It also doesn't repro with an all-zero input.
        let input = &[0xff; 65];
        let expected_hash = [
            183, 235, 50, 217, 156, 24, 190, 219, 2, 216, 176, 255, 224, 53, 28, 95, 57, 148, 179,
            245, 162, 90, 37, 121, 0, 142, 219, 62, 234, 204, 225, 161,
        ];

        // This throwaway call has to be here to trigger the bug.
        crate::Hasher::new().update(input);

        // This assert fails.
        assert_eq!(crate::Hasher::new().update(input).finalize(), expected_hash);
    }
}

That test is pretty wacky, but as far as I can tell, most of the details there are actually required to trigger the bug, including the weird for loop that just executes once. (It doesn't seem to be overly sensitive to the compiler version though. I've tried it on rustc 1.54, 1.56, and nightly, and it seems to behave the same across those.)

That test passes on my Windows box with cargo test but fails with:

cargo test --features=no_avx512,no_avx2,no_sse41 --release

--features=no_avx512,no_avx2,no_sse41 forces us into the SSE2 implementation by disabling everything else. The --release flag there is also required to trigger the failure; it doesn't happen in debug mode. Here's a CI run on this commit, which demonstrates the failure. Note that this run only fails on Windows MSVC, but my original run above failed on Windows GNU also, so both implementations are probably affected. I haven't seen any failures yet with the intrinsics implementation (which you can test with the prefer_intrinsics feature).

When I try to backport this test to the original SSE2 PR, I can repro the failure there. So whatever the issue is, I think it was present in the original implementation rather than something introduced since.

cc @sneves @mkrupcale

@sneves
Copy link
Collaborator

sneves commented Nov 5, 2021

The SSE2 patch introduced xmm10 as a temporary register for one of the rotations, but xmm6-xmm15 are callee-save registers on Windows, and SSE4.1 was only saving the registers it used.

The minimal fix is to use one of the saved registers instead of xmm10:

diff --git a/c/blake3_sse2_x86-64_windows_gnu.S b/c/blake3_sse2_x86-64_windows_gnu.S
index 424b4f8..8852ba5 100644
--- a/c/blake3_sse2_x86-64_windows_gnu.S
+++ b/c/blake3_sse2_x86-64_windows_gnu.S
@@ -2137,10 +2137,10 @@ _blake3_compress_in_place_sse2:
         por     xmm9, xmm8
         movdqa  xmm8, xmm7
         punpcklqdq xmm8, xmm5
-        movdqa  xmm10, xmm6
+        movdqa  xmm14, xmm6
         pand    xmm8, xmmword ptr [PBLENDW_0x3F_MASK+rip]
-        pand    xmm10, xmmword ptr [PBLENDW_0xC0_MASK+rip]
-        por     xmm8, xmm10
+        pand    xmm14, xmmword ptr [PBLENDW_0xC0_MASK+rip]
+        por     xmm8, xmm14
         pshufd  xmm8, xmm8, 0x78
         punpckhdq xmm5, xmm7
         punpckldq xmm6, xmm5
@@ -2268,10 +2268,10 @@ blake3_compress_xof_sse2:
         por     xmm9, xmm8
         movdqa  xmm8, xmm7
         punpcklqdq xmm8, xmm5
-        movdqa  xmm10, xmm6
+        movdqa  xmm14, xmm6
         pand    xmm8, xmmword ptr [PBLENDW_0x3F_MASK+rip]
-        pand    xmm10, xmmword ptr [PBLENDW_0xC0_MASK+rip]
-        por     xmm8, xmm10
+        pand    xmm14, xmmword ptr [PBLENDW_0xC0_MASK+rip]
+        por     xmm8, xmm14
         pshufd  xmm8, xmm8, 0x78
         punpckhdq xmm5, xmm7
         punpckldq xmm6, xmm5
diff --git a/c/blake3_sse2_x86-64_windows_msvc.asm b/c/blake3_sse2_x86-64_windows_msvc.asm
index ff9bb4d..507502f 100644
--- a/c/blake3_sse2_x86-64_windows_msvc.asm
+++ b/c/blake3_sse2_x86-64_windows_msvc.asm
@@ -2139,10 +2139,10 @@ _blake3_compress_in_place_sse2 PROC
         por     xmm9, xmm8
         movdqa  xmm8, xmm7
         punpcklqdq xmm8, xmm5
-        movdqa  xmm10, xmm6
+        movdqa  xmm14, xmm6
         pand    xmm8, xmmword ptr [PBLENDW_0x3F_MASK]
-        pand    xmm10, xmmword ptr [PBLENDW_0xC0_MASK]
-        por     xmm8, xmm10
+        pand    xmm14, xmmword ptr [PBLENDW_0xC0_MASK]
+        por     xmm8, xmm14
         pshufd  xmm8, xmm8, 78H
         punpckhdq xmm5, xmm7
         punpckldq xmm6, xmm5
@@ -2271,10 +2271,10 @@ _blake3_compress_xof_sse2 PROC
         por     xmm9, xmm8
         movdqa  xmm8, xmm7
         punpcklqdq xmm8, xmm5
-        movdqa  xmm10, xmm6
+        movdqa  xmm14, xmm6
         pand    xmm8, xmmword ptr [PBLENDW_0x3F_MASK]
-        pand    xmm10, xmmword ptr [PBLENDW_0xC0_MASK]
-        por     xmm8, xmm10
+        pand    xmm14, xmmword ptr [PBLENDW_0xC0_MASK]
+        por     xmm8, xmm14
         pshufd  xmm8, xmm8, 78H
         punpckhdq xmm5, xmm7
         punpckldq xmm6, xmm5

oconnor663 added a commit that referenced this issue Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the
rotations, but xmm6-xmm15 are callee-save registers on Windows, and
SSE4.1 was only saving the registers it used. The minimal fix is to use
one of the saved registers instead of xmm10.

See #206.
oconnor663 added a commit that referenced this issue Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the
rotations, but xmm6-xmm15 are callee-save registers on Windows, and
SSE4.1 was only saving the registers it used. The minimal fix is to use
one of the saved registers instead of xmm10.

See #206.
oconnor663 added a commit that referenced this issue Nov 5, 2021
The SSE2 patch introduced xmm10 as a temporary register for one of the
rotations, but xmm6-xmm15 are callee-save registers on Windows, and
SSE4.1 was only saving the registers it used. The minimal fix is to use
one of the saved registers instead of xmm10.

See #206.
oconnor663 added a commit that referenced this issue Nov 5, 2021
Changes since 1.1.0:
- SECURITY FIX: Fixed an instance of undefined behavior in the Windows
  SSE2 assembly implementations, which affected both the Rust and C
  libraries in their default build configurations. See
  #206. The cause was a
  vector register that wasn't properly saved and restored. This bug has
  been present since SSE2 support was initially added in v0.3.7. The
  effects of this bug depend on surrounding code and compiler
  optimizations; see test_issue_206_windows_sse2 for an example of this
  bug causing incorrect hash output. Note that even when surrounding
  code is arranged to trigger this bug, the SSE2 implementation is
  normally only invoked on CPUs where SSE4.1 (introduced in 2007) isn't
  supported. One notable exception, however, is if the Rust library is
  built in `no_std` mode, with `default_features = false` or similar. In
  that case, runtime CPU feature detection is disabled, and since LLVM
  assumes that all x86-64 targets support SSE2, the SSE2 implementation
  will be invoked. For that reason, Rust callers who build `blake3` in
  `no_std` mode for x86-64 Windows targets are the most likely to
  trigger this bug. We found this bug in internal testing, and we aren't
  aware of any callers encountering it in practice.
- Added the Hasher::count() method.
@oconnor663
Copy link
Member Author

Fixed and released as v1.2.0.

kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this issue Oct 23, 2023
The SSE2 patch introduced xmm10 as a temporary register for one of the
rotations, but xmm6-xmm15 are callee-save registers on Windows, and
SSE4.1 was only saving the registers it used. The minimal fix is to use
one of the saved registers instead of xmm10.

See BLAKE3-team#206.
kevingoh pushed a commit to ITS-AT-dev/BLAKE3 that referenced this issue Oct 23, 2023
Changes since 1.1.0:
- SECURITY FIX: Fixed an instance of undefined behavior in the Windows
  SSE2 assembly implementations, which affected both the Rust and C
  libraries in their default build configurations. See
  BLAKE3-team#206. The cause was a
  vector register that wasn't properly saved and restored. This bug has
  been present since SSE2 support was initially added in v0.3.7. The
  effects of this bug depend on surrounding code and compiler
  optimizations; see test_issue_206_windows_sse2 for an example of this
  bug causing incorrect hash output. Note that even when surrounding
  code is arranged to trigger this bug, the SSE2 implementation is
  normally only invoked on CPUs where SSE4.1 (introduced in 2007) isn't
  supported. One notable exception, however, is if the Rust library is
  built in `no_std` mode, with `default_features = false` or similar. In
  that case, runtime CPU feature detection is disabled, and since LLVM
  assumes that all x86-64 targets support SSE2, the SSE2 implementation
  will be invoked. For that reason, Rust callers who build `blake3` in
  `no_std` mode for x86-64 Windows targets are the most likely to
  trigger this bug. We found this bug in internal testing, and we aren't
  aware of any callers encountering it in practice.
- Added the Hasher::count() method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants