Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds BLAKE3 to the available hashing algorithms.
NOTE: Although this PR is complete (and I would appreciate any feedback) I'm adding it as a draft for now because I would also like to try working on alternative PR based on the multi-threaded Rust implementation (more below).
Context
The change is relatively small and non-invasive.
I added the BLAKE3 source tarball as a derivation and used this to define a
NIX_BLAKE3_SRC
environment variable which is then referenced in the makefiles.The recommended way to build the BLAKE3 C implementation is just to add the files directly to the build system rather than trying to compile separately as a library, so that's the reasoning for fetching the tarball.
In order to handle building source files from
NIX_BLAKE3_SRC
(which is read-only) I needed to add some custom build rules specifically for those files.I also added platform detection for
ARM
andx86_64
(along with detection for Darwin, Linux, and Windows) and use this to conditionally compile the appropriate SIMD implementations for the given platform.I use the assembly files directly rather than the C-based intrinsics versions since that is also the recommended approach:
The BLAKE3 dispatcher will automatically fall back to the portable implementation if a hardware accelerated implementation is unavailable.
Performance
I have run benchmarks of the implementation which I detail below.
First, though, some important things to note:
This is the C implementation, which is single-threaded. Although it is very fast, the Rust version which uses Rayon for multi-threading scales almost linearly up to memory bandwidth limits, so it's obviously significantly faster.
The NEON implementation is known to not be nearly as performant as the SSE and AVX implementations:
Test file was generated with
head -c 5G /dev/urandom > ~/Downloads/largefile.bin
Apple M3 Max
CFLAGS="-O3 -mcpu=apple-m2" configurePhase
(andOPTIMIZE=1
)BLAKE3
AMD Zen 4 Ryzen 9 7950x
CFLAGS="-O3 -march=znver4" configurePhase
(andOPTIMIZE=1
)BLAKE3
SHA256
SHA512