Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider multi-message batching #386

Open
riptl opened this issue Feb 23, 2024 · 0 comments
Open

Consider multi-message batching #386

riptl opened this issue Feb 23, 2024 · 0 comments

Comments

@riptl
Copy link

riptl commented Feb 23, 2024

Problem

The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.

Namely, it doesn't SIMD parallelize hashing for small messages.

As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend:
https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3

When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.

image

I would like to contribute back my changes to this official library.
My code is Apache-2.0 licensed, so feel free to copy from it.

Suggested Changes

There are three major pieces required:

  1. Adapt the SIMD backends to do work off each lane independently, namely:
    • Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
    • Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
  2. Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
    • I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
    • When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
    • I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.
  3. Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states
    • The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
    • Another way is to use thread-local storage to keep track of streaming operations on the current thread
    s1 := Blake3::new()
    s2 := Blake3::new()
    s1.append("abcd");  // registers this append operation as a thread-local
    s2.append("1234"): 
    hash2 := s2.fini();  // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
    hash1 := s1.fini();  // no-op! the result is already available
    
@riptl riptl changed the title Consider multi-block batching Consider multi-message batching Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant