Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in bao tree? #1288

Closed
rklaehn opened this issue Jul 20, 2023 · 6 comments
Closed

Performance regression in bao tree? #1288

rklaehn opened this issue Jul 20, 2023 · 6 comments
Labels
bug Something isn't working c-iroh-bytes perf performance related issues

Comments

@rklaehn
Copy link
Contributor

rklaehn commented Jul 20, 2023

Played around with ML data sets again.

For some reason ingestion of a new file is much slower in 0.5 than it used to be in 0.4.1.

The only thing that is done when adding a file is computing the outboard, synchronously, via bao_tree::io::outboard_post_order.

I don't remember much changing regarding this in bao-tree. The only noteworthy thing was moving from ouroboros to self_cell.

Another culprit could be the new threading concept, but since this is working on the blocking pool I find this unlikely.

@rklaehn rklaehn added perf performance related issues bug Something isn't working c-iroh-bytes labels Jul 20, 2023
@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 28, 2023

Looked into this a bit. It seems that iroh add is about half as fast in 0.5.2 as in 0.4.1. Looking at the flamegraph does not reveal any immediate insights. Most time is spent in the blake3 crate, as you would expect.

Since it is 1/2 as fast as before, maybe I am doing something twice?

One thing that is notable is that this is using the portable impl on m1 mac because the neon SIMD feature is not automatically detected. See from the crate docs:

# The NEON implementation does not participate in dynamic feature detection,
# which is currently x86-only. If "neon" is on, NEON support is assumed. Note
# that AArch64 always supports NEON, but support on ARMv7 varies. The NEON
# implementation uses C intrinsics and requires a C compiler.
neon = []

We could manually enable neon on macs.

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 29, 2023

It looks like abao is more efficient when computing a chunk group hash compared to bao-tree.

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 29, 2023

I still have no idea whatsoever what caused the regression. Looking at the flamegraph showed that the version I got locally is using bao-tree and isn't that old. Might be because of release-optimized and LTO.

But I looked into this a bit more. There is currently a limitation of all bao implementations (bao, my abao, and bao-tree) that they can not tap into the performance of the blake3 crate.

The blake3 crate provides efficient ways to hash data (blake3::hash and blake3::Hasher), but these ways do not provide a way to

  • start at a non zero chunk offset and
  • compute a non-root hash.

The functions that are exposed in blake3::guts that allow working with non zero chunk offsets and producing non root hashes are not as efficient. (blake3::guts::parent_cv and blake3::guts::ChunkState).

What is needed is to expose a function

pub fn hash_block(start_chunk: u64, data: &[u8], is_root: bool) -> Hash {

in blake3::guts that would benefit all bao impls.

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 29, 2023

Added a PR to BLAKE3: BLAKE3-team/BLAKE3#329

@rklaehn
Copy link
Contributor Author

rklaehn commented Aug 3, 2023

Fixed by #1322 or if the above PR gets merged and we can use it.

@b5
Copy link
Member

b5 commented Aug 16, 2023

marking as closed by #1322, feel free to re-open if you'd like to use this to track the BLAKE3 PR @rklaehn, but I wouldn't hold my breath :)

@b5 b5 closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-iroh-bytes perf performance related issues
Projects
Archived in project
Development

No branches or pull requests

2 participants