Consider multi-message batching #386

riptl · 2024-02-23T03:08:48Z

Problem

The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.

Namely, it doesn't SIMD parallelize hashing for small messages.

As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend:
https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3

When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.

I would like to contribute back my changes to this official library.
My code is Apache-2.0 licensed, so feel free to copy from it.

Suggested Changes

There are three major pieces required:

Adapt the SIMD backends to do work off each lane independently, namely:
- Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
- Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
- I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
- When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
- I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.

Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states

The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
Another way is to use thread-local storage to keep track of streaming operations on the current thread

s1 := Blake3::new()
s2 := Blake3::new()
s1.append("abcd");  // registers this append operation as a thread-local
s2.append("1234"): 
hash2 := s2.fini();  // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
hash1 := s1.fini();  // no-op! the result is already available

The text was updated successfully, but these errors were encountered:

riptl changed the title ~~Consider multi-block batching~~ Consider multi-message batching Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider multi-message batching #386

Consider multi-message batching #386

riptl commented Feb 23, 2024 •

edited

Consider multi-message batching #386

Consider multi-message batching #386

Comments

riptl commented Feb 23, 2024 • edited

riptl commented Feb 23, 2024 •

edited