Skip to content

Commit

Permalink
guts readme updates
Browse files Browse the repository at this point in the history
  • Loading branch information
oconnor663 committed Jan 22, 2024
1 parent 1ca383b commit 1a6c1e2
Showing 1 changed file with 54 additions and 36 deletions.
90 changes: 54 additions & 36 deletions rust/guts/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,65 @@

## Introduction

This crate contains low-level, high-performance, platform-specific
implementations of the BLAKE3 compression function. This API is complicated and
unsafe, and this crate will never have a stable release. For the standard
BLAKE3 hash function, see the [`blake3`](https://crates.io/crates/blake3)
crate, which depends on this one.

The most important ingredient in a high-performance implementation of BLAKE3 is
parallelism. The BLAKE3 tree structure lets us hash different parts of the tree
in parallel, and modern computers have a _lot_ of parallelism to offer.
Sometimes that means using multiple threads running on multiple cores, but
multithreading isn't appropriate for all applications, and it's not the usual
default for library APIs. More commonly, BLAKE3 implementations use SIMD
instructions ("Single Instruction Multiple Data") to improve the performance of
a single thread. When we do use multithreading, the performance benefits
multiply.

The tricky thing about SIMD is that each instruction set works differently.
Instead of writing portable code once and letting the compiler do most of the
optimization work, we need to write platform-specific implementations, and
sometimes more than one per platform. We maintain *four* different
implementations on x86 alone (targeting SSE2, SSE4.1, AVX2, and AVX-512), in
addition to ARM NEON and the RISC-V vector extensions. In the future we might
add ARM SVE2.

All of that means a lot of duplicated logic and maintenance. So while the main
goal of this API is high performance, it's also important to keep the API as
small and simple as possible. Higher level details like the "CV stack", input
buffering, and multithreading are handled by portable code in the main `blake3`
crate. These are just building blocks.

## The private API

This is the API that each platform reimplements. It's completely `unsafe`,
inputs and outputs are allowed to alias, and bounds checking is the caller's
responsibility.
This [`blake3_guts`](https://crates.io/crates/blake3_guts) sub-crate contains
low-level, high-performance, platform-specific implementations of the BLAKE3
compression function. This API is complicated and unsafe, and this crate will
never have a stable release. Most callers should instead use the
[`blake3`](https://crates.io/crates/blake3) crate, which will eventually depend
on this one internally.

The code you see here (as of January 2024) is an early stage of a large planned
refactor. The motivation for this refactor is a couple of missing features in
both the Rust and C implementations:

- The output side
([`OutputReader`](https://docs.rs/blake3/latest/blake3/struct.OutputReader.html)
in Rust) doesn't take advantage of the most important SIMD optimizations that
compute multiple blocks in parallel. This blocks any project that wants to
use the BLAKE3 XOF as a stream cipher
([[1]](https://github.com/oconnor663/bessie),
[[2]](https://github.com/oconnor663/blake3_aead)).
- Low-level callers like [Bao](https://github.com/oconnor663/bao) that need
interior nodes of the tree also don't get those SIMD optimizations. They have
to use a slow, minimalistic, unstable, doc-hidden module [(also called
`guts`)](https://github.com/BLAKE3-team/BLAKE3/blob/master/src/guts.rs).

The difficulty with adding those features is that they require changes to all
of our optimized assembly and C intrinsics code. That's a couple dozen
different files that are large, platform-specific, difficult to understand, and
full of duplicated code. The higher-level Rust and C implementations of BLAKE3
both depend on these files and will need to coordinate changes.

At the same time, it won't be long before we add support for more platforms:

- RISCV vector extensions
- ARM SVE
- WebAssembly SIMD

It's important to get this refactor done before new platforms make it even
harder to do.

## The private guts API

This is the API that each platform reimplements, so we want it to be as simple
as possible apart from the high-performance work it needs to do. It's
completely `unsafe`, and inputs and outputs are raw pointers that are allowed
to alias (this matters for `hash_parents`, see below).

- `degree`
- `compress`
- The single compression function, for short inputs and odd-length tails.
- `hash_chunks`
- `hash_parents`
- `xof`
- `xof_xor`
- As `xof` but XOR'ing the result into the output buffer.
- `universal_hash`
- This is a new construction specifically to support
[BLAKE3-AEAD](https://github.com/oconnor663/blake3_aead). Some
implementations might just stub it out with portable code.

## The public API
## The public guts API

This is the API that this crate exposes to callers, i.e. to the main `blake3`
crate. It's a thin, portable layer on top of the private API above. The Rust
Expand All @@ -56,7 +70,11 @@ version of this API is memory-safe.
- `compress`
- `hash_chunks`
- `hash_parents`
- This handles most levels of the tree, where we keep hashing SIMD_DEGREE
parents at a time.
- `reduce_parents`
- This uses the same `hash_parents` private API, but it handles the top
levels of the tree where we reduce in-place to the root parent node.
- `xof`
- `xof_xor`
- `universal_hash`

0 comments on commit 1a6c1e2

Please sign in to comment.