Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON version is only 26% faster than portable on Raspberry Pi 4 #310

Open
1f604 opened this issue Jun 8, 2023 · 4 comments
Open

NEON version is only 26% faster than portable on Raspberry Pi 4 #310

1f604 opened this issue Jun 8, 2023 · 4 comments

Comments

@1f604
Copy link
Contributor

1f604 commented Jun 8, 2023

Hi all,

I compiled the example.c with and without NEON support on my Raspberry Pi 4 and got these results (using the same 2GB test file):

  • sha1sum file: 12.7s
  • sha256sum file: 18.9s
  • cat file | example program portable: 12s
  • cat file | example program NEON: 9.5s
  • md5sum file: 9.8s
  • xxhsum file: 1.9s
  • cat file | xxhsum: 4s

I also installed Rust and b3sum and got these results:

  • b3sum 1 thread no mmap: 8s
  • b3sum 4 threads no mmap: 8s
  • b3sum 1 thread: 7.9s
  • b3sum 2 threads: 4s
  • b3sum 3 threads: 2.7s
  • b3sum 4 threads: 2s
  • b3sum 16 threads: 2s
  • cat file | b3sum: 10s

The running time is clearly not IO dominated since xxhash only took 2 seconds to hash while the NEON-compiled example took 9.5 seconds. Okay, so piping the file into b3sum instead of just calling b3sum file adds 2s to the running time. But even if we shave off 2 seconds due to piping in to stdin, it's clear that most of the time is spent in the CPU rather than IO.

So the results show that the NEON version of BLAKE3 is only about 26% faster than the portable version.

I don't understand why compiling with and without NEON doesn't seem to make that much of a difference.

I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?

Maybe it is due to GCC producing bad NEON code? Is there an assembly version?

I am using GCC 10.2.1.

Thanks a lot!

EDIT: Compiling with clang 11.0.1-2 instead of GCC improved performance by about 7% (9.5s -> 8.9s average). I did not notice a difference after PGO with either GCC or clang.

@oconnor663
Copy link
Member

I don't know enough about NEON performance to give you a good answer, but I can say that my own experience on the RPi4 was pretty similar. It's a small boost over portable. I've seen cases where other ARM CPUs get closer to a 2x speedup though.

It's just about guaranteed that there's low-hanging fruit in the NEON code, and that we could speed it up by finding my dumb mistakes. blake3_neon.c is a pretty naive port of the SSE4.1 implementation, and it's the only NEON code I've ever written.

@rhpvorderman
Copy link

I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?

Native integers on ARM64 are 64-bit wide. NEON registers are 128-bit wide. The maximum attainable speedup is 2x. (In theory it can be faster if NEON instructions allow things that cannot be done with the integer instructions which can be leveraged, but this is rather unusual.)

@oconnor663
Copy link
Member

BLAKE3 uses 32-bit words, though, so I think it makes sense to say the maximum attainable speedup is 4x?

@sneves
Copy link
Collaborator

sneves commented Jun 17, 2023

The microarchitecture matters more than the instruction set. The Raspberry Pi 4 uses a Cortex-A72, and looking at the instruction properties we see that we can execute 2 scalar adds/xor/rotations per cycle (and some rotations may come for free). With NEON, we have 2 adds and xors per cycle, but no native rotation instructions, which are replaced by shl+shr+orr, the first 2 of which can only be dispatched to one execution unit per cycle.

The arithmetic operation count of the BLAKE3 core can be approximated by 336 adds, 224 xors, and 224 rotations. Since there is sufficient parallelism within the round, the bottleneck is instruction throughput and we can lower bound the performance by (336+224+224)/2/64 ~ 6.125 cycles per byte. For NEON on the other hand, we have (336/2 + 224/2 + 224*(1+1+1/2))/(64*4) ~ 3.28 cycles per byte. Looking at eBASH we see a measured value of 4.78 cycles per byte.

So based on basic arithmetic costs alone, we are limited to at best a little under 2x speedup for NEON on this chip. The remainder of the overhead could be attributed to the rest of the compression function operations (e.g., transposing the message into place) or poor GCC code generation; this microarchitecture is not very wide, so instruction scheduling could still make a significant difference here. Hard to say without looking at specifics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants