Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blake3 single thread is slower than sha256 on Apple silicon #315

Open
nirs opened this issue Jun 17, 2023 · 3 comments
Open

blake3 single thread is slower than sha256 on Apple silicon #315

nirs opened this issue Jun 17, 2023 · 3 comments

Comments

@nirs
Copy link

nirs commented Jun 17, 2023

On intel cpus I see ~10x speedup for the C implementation, but on Apple silicon it
is 1.3x times slower than sha256.

I built the C version both with cmake and manually (based on README.md), both show
same performance, matching b3sum performance with single threads.

cmake build:

% cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is AppleClang 14.0.3.14030022
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- 
 * NEON SIMD intrinsics, The library uses NEON SIMD intrinsics.

-- Configuring done (0.2s)
-- Generating done (0.0s)
-- Build files have been written to: /Users/nir/src/BLAKE3/c/build

% cmake --build build
[ 20%] Building C object CMakeFiles/blake3.dir/blake3.c.o
[ 40%] Building C object CMakeFiles/blake3.dir/blake3_dispatch.c.o
[ 60%] Building C object CMakeFiles/blake3.dir/blake3_portable.c.o
[ 80%] Building C object CMakeFiles/blake3.dir/blake3_neon.c.o
clang: warning: argument unused during compilation: '-mfpu=neon' [-Wunused-command-line-argument]
[100%] Linking C static library libblake3.a
[100%] Built target blake3

Manual build:

% mkdir build
% gcc -shared -O3 -o build/libblake3.so -DBLAKE3_USE_NEON=1 blake3.c blake3_dispatch.c \
    blake3_portable.c blake3_neon.c

Building the example:

% cc example.c -O3 -L build -lblake3 -o build/example

Creating test file:

% dd if=/dev/zero bs=1M count=4096 of=test.data

Testing read throughput from pipe:

 % time dd bs=64K of=/dev/null status=none < test.data
dd bs=64K of=/dev/null status=none < test.data  0.02s user 0.34s system 99% cpu 0.355 total

Measuring hash throughput:

% time openssl sha256 < test.data
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca
openssl sha256 < test.data  1.71s user 0.49s system 99% cpu 2.202 total

% time build/example < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7
build/example < test.data  2.58s user 0.36s system 99% cpu 2.942 total

 % time b3sum < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  -
b3sum < test.data  2.59s user 0.36s system 99% cpu 2.951 total

 % time b3sum --num-threads 1 < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  -
b3sum --num-threads 1 < test.data  2.58s user 0.36s system 99% cpu 2.935 total

Note: testing with openssl sha256 since both shasum -a 256 and sha256sum
(from coreutils) are extremely slow (~6x times slower) on macOS.

Looking in openssl code, sha256 is using sha256-armv8.S on this machine.

Tested on MacBook Pro M2 Max.

@oconnor663
Copy link
Member

ARM NEON only provides 128-bit vector registers, compared to the 512-bit registers available on Intel CPUs that support AVX-512, and that's a big part of the difference you're seeing. There are lots of other details besides just vector size that also come into play here; see for example @sneves' comment on another recent thread.

ARM SVE and SVE2 can potentially provide larger vectors, but I'm not aware of any consumer hardware that supports those. It'll probably make sense for BLAKE3 to provide an SVE implementation at some point.

@sneves
Copy link
Collaborator

sneves commented Jun 17, 2023

Apple Silicon also happens to have fast SHA-256 dedicated instructions. This is why openssl sha256 is much faster, since it uses them instead of a pure software implementation.

@nirs
Copy link
Author

nirs commented Feb 26, 2024

Updating results, we now only 12-13% difference.

Tested with:

  • OpenSSL 3.2.1 30 Jan 2024 (Library: OpenSSL 3.2.1 30 Jan 2024)

  • b3sum 1.5.0 (from brew)

  • 3 versions from git commit 8fc3618

    • example-brew - linked with blake3 1.5.0 from brew

      gcc -O3 -o example-brew c/example.c $(pkg-config --libs libblake3)
      
    • example-neon - built from source with neon support

      gcc -O3 -o example-neon -DBLAKE3_USE_NEON=1 c/example.c c/blake3.c c/blake3_dispatch.c c/blake3_portable.c c/blake3_neon.c
      
    • example-portable - built from source without neon support

      gcc -O3 -o example-portable -DBLAKE3_USE_NEON=0 c/example.c c/blake3.c c/blake3_dispatch.c c/blake3_portable.c
      
% hyperfine -w 2 "openssl sha256 < /var/tmp/1g.img" \
                 "b3sum < /var/tmp/1g.img" \
                 "./example-brew < /var/tmp/1g.img" \
                 "./example-neon < /var/tmp/1g.img" \
                 "./example-portable < /var/tmp/1g.img"
Benchmark 1: openssl sha256 < /var/tmp/1g.img
  Time (mean ± σ):     553.2 ms ±   1.2 ms    [User: 428.8 ms, System: 112.5 ms]
  Range (min … max):   550.5 ms … 555.4 ms    10 runs

Benchmark 2: b3sum < /var/tmp/1g.img
  Time (mean ± σ):     621.3 ms ±   1.2 ms    [User: 536.2 ms, System: 72.9 ms]
  Range (min … max):   619.1 ms … 622.9 ms    10 runs

Benchmark 3: ./example-brew < /var/tmp/1g.img
  Time (mean ± σ):     626.7 ms ±   1.1 ms    [User: 542.9 ms, System: 71.1 ms]
  Range (min … max):   624.8 ms … 628.0 ms    10 runs

Benchmark 4: ./example-neon < /var/tmp/1g.img
  Time (mean ± σ):     619.1 ms ±   1.2 ms    [User: 534.1 ms, System: 70.8 ms]
  Range (min … max):   617.5 ms … 622.1 ms    10 runs

Benchmark 5: ./example-portable < /var/tmp/1g.img
  Time (mean ± σ):      1.315 s ±  0.004 s    [User: 1.204 s, System: 0.084 s]
  Range (min … max):    1.308 s …  1.322 s    10 runs

Summary
  'openssl sha256 < /var/tmp/1g.img' ran
    1.12 ± 0.00 times faster than './example-neon < /var/tmp/1g.img'
    1.12 ± 0.00 times faster than 'b3sum < /var/tmp/1g.img'
    1.13 ± 0.00 times faster than './example-brew < /var/tmp/1g.img'
    2.38 ± 0.01 times faster than './example-portable < /var/tmp/1g.img'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants