Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RVV backend #372

Closed
wants to merge 2 commits into from
Closed

Conversation

silvanshade
Copy link

No description provided.

c/CMakeLists.txt Outdated
Comment on lines 3 to 16
set(CMAKE_GENERATOR Ninja)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_CROSSCOMPILING_EMULATOR qemu-riscv64-static)
set(CMAKE_ASM_COMPILER clang-17)
set(CMAKE_ASM_COMPILER_TARGET riscv64-unknown-linux-gnu)
set(CMAKE_ASM_FLAGS_INIT "-march=rv64gcv1p0")
set(CMAKE_C_COMPILER clang-17)
set(CMAKE_C_COMPILER_TARGET riscv64-unknown-linux-gnu)
set(CMAKE_C_FLAGS_INIT "-march=rv64gcv1p0")
set(CMAKE_CXX_COMPILER clang++-17)
set(CMAKE_CXX_COMPILER_TARGET riscv64-unknown-linux-gnu)
set(CMAKE_CXX_FLAGS_INIT "-flto=thin-march=rv64gcv1p0")
set(CMAKE_EXE_LINKER_FLAGS "-fuse-ld=lld-17")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using CMakePresets as they are quite a bit more ergonomic.

@oconnor663
Copy link
Member

I have less free time for code reviews than I used to, so apologies in advance for taking a while to get to this. You might be interested in an RVV assembly implementation that I've been working on here: https://github.com/BLAKE3-team/BLAKE3/blob/guts_api/rust/guts/src/riscv_rva23u64.S. Unfortunately that branch is tied to a large refactoring, which makes it hard for me to land it in master.

@silvanshade
Copy link
Author

silvanshade commented Jan 18, 2024

@oconnor663 Oh cool, I didn't realize there was already some implementation work for RVV.

I'll probably give it a closer look soon but just out of curiosity, what state is it in? Any idea about the performance characteristics of it or anything else interesting to note?

Also, have you done any work on any SVE backend?

@oconnor663
Copy link
Member

(I just pushed a commit to clean up some function names, so you might need to refresh the page if you still have that .S file open.)

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests. The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2. There was also a minor perf regression in AVX512 that I'll need to track down. Then there are loose ends to tie up around e.g. MSVC-flavored assembly.

Most of the heavy lifting in the parallel implementation (which is what really matters for performance) is in blake3_guts_riscv_rva23u64_kernel, but that code is pretty straightforward without any significant open questions. There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

I haven't tried ARM SVE yet, no. (Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

@silvanshade
Copy link
Author

My implementation uses the Zbb and Zvbb extensions, so I don't think it will run on most real chips yet, even those that support V 1.0. I've been doing all the development under Qemu, so I've never done any real benchmarks, but it is passing tests.

Interesting. Thanks for the information.

I've also been doing most of my experimentation under qemu. I did recently get a hold of a Pioneer (SG2042) but it only supports their 0.71 RVV and I haven't even tried to get tooling to work with that yet (in fact I've barely just gotten it to boot, heh). But it might be interesting to try and adapt what you have (sans the Zbb/Zvbb and whatever else is missing).

The missing work that makes it hard to land this is porting other SIMD implementations to this new API. I've done AVX-512 on that branch, but I need to do SSE2/4.1 and AVX2.

I'd be interested in helping with that effort if you'd like. If you could give me some pointers on where to start or whatever, I'd certainly take a look.

There are more questions about how transposition should be done in calling functions like blake3_guts_riscv_rva23u64_hash_blocks, which currently uses vlsseg8e32.v. That instruction might be slow on real hardware, and I might need to experiment with doing simpler loads and then transposing in registers.

Yeah, I noticed that. Seemed interesting. I'm also wondering how that will work out.

I haven't tried ARM SVE yet, no.

I was really kind of looking for an interesting project to try something VLA related but since it seems like you've mostly solved the RVV side, maybe I will give SVE a try instead.

(Also the NEON implementation in master almost certainly has some perf mistakes that someone more experienced could spot.)

I actually made an attempt to finish the missing parts for the NEON implementation at #369. I'm certainly not an expert though and this was my first real attempt using NEON for anything.

Like you suggested though, implementing compress didn't make any practical difference. I tried a few different approaches there but overall nothing seemed to help. I'm guessing it will be hard to get better performance without some sort of more fundamental redesign of the algorithm but I don't even know what that would look like. I suspect all the shuffling in particular is hard to make efficient for NEON.

One thing I was thinking about though, for better performance on Apple Silicon at least, is to try an implementation using Metal, but making use of the unified memory modes to try and avoid the latency issues that made the Vulkan (and SYCL version I saw elsewhere) not very usable.

Another thing I've been wondering about is whether it might be possible to use the AMX coprocessor for some parts of the algorithm, perhaps genlut in particular.

Anyway, interesting stuff. Let me know if there's some way I can help with that branch or maybe if you have some suggestions for other ideas worth exploring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants