Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interest in GPU-accelerated scalar multiplications? #506

Open
rickwebiii opened this issue Feb 13, 2023 · 4 comments
Open

Interest in GPU-accelerated scalar multiplications? #506

rickwebiii opened this issue Feb 13, 2023 · 4 comments

Comments

@rickwebiii
Copy link

rickwebiii commented Feb 13, 2023

A ZKP system we're implementing internally uses a modified Bulletproofs inner product proof. We want to make it faster, so we implemented scalar multiplication (SM) in Metal (OpenCL/Vulkan coming soon). This operation accounts for 75% of our proof's runtime. Is there any interest in merging these changes back upstream?

Note, we've sped up computing multiple scalar multiplication in parallel, not a multi-scalar multiplication (MSM). Though we may do that in the future as well.

The basic idea

  • Implement new ScalarVec and RistrettoPointVec types. These hold an array of Scalar or EdwardsPoints respectively with the data packed so the GPU can perform coalesced loads and stores. I.e. in a vector with n elements the jth limb of the ith point holds is at arr[j * n + i].
  • The compute Metal shaders are transliterated C++ from the 32-bit backend. The shaders' RistrerettoPoint and Scalar types feature additional methods for packing and unpacking from device memory into registers.
  • All computation occurs in GPU registers.
  • We launch a kernel on a 1-d grid with N threads and 64 threads in a threadgroup.

Performance

We're so far only run this on Mac hardware (what Metal supports) and have pretty good results. On an entry-level M1 2020 Macbook air, the integrated GPU acheives 431k SM/s while the using every core of the CPU gets you only 150k SM/s (64-bit backend). For comparison, our measurements indicate that the 64-bit backend on aarch64 is slightly faster than when running on a c6 AWS x86_64 instance with the AVX2 backend (single core performance).

We've also run this benchmark on a co-worker's 30-core GPU M2 Max which can perform 1.8M SM/s. A low-power laptop GPU beats a 64-core AWS instance by 1.8x (AVX2 backend).

Work remaining

  • Either an OpenCL or Vulkan implementation
  • Share work between CPU and GPU on systems where GPU isn't way more powerful than CPUs.
  • A few pieces of code may not be constant time (e.g. the Radix-16 lookup tables). While this doesn't matter in our proof since both the prover and verifier are performing the same SMs, we don't want others to misuse this.
  • This work currently requires the serial 32-bit backend. We should add conversion methods between FieldElement* and Scalar* types for different backends so users can use any bit width and backend on the CPU.
  • Benchmark this on a real GPU, like an RTX 4090.

Why you might not want this

At the very least, these GPU backends should be enabled under a feature. While the GPU shaders are transliterated from Rust code, they aren't Rust and don't have the safety guarantees.

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

@tarcieri
Copy link
Contributor

Is there any interest in merging these changes back upstream?

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If it's not appropriate to integrate this work into curve25519-dalek, we need an API for accessing the currently internal Field members on EdwardsPoint and an API for converting/accessing Scalars as Scalar29.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

@rickwebiii
Copy link
Author

The main thing that seems tricky about upstreaming this is the lack of a CI story. Is there any way to test it in an environment like GitHub Actions without access to a hardware GPU?

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that. We have a work item on our end to address this for our use case so maybe we could try it out and let you know how annoying (or not) this is.

These types are both deliberately kept out of the public API as they're easily misused internals. However, we have discussed exposing at least FieldElement under a special feature like hazmat.

As a user, I appreciated not accidentally dealing with these things. A hazmat feature is a clean way to deal with this.

@tarcieri
Copy link
Contributor

If you use a hosted runner, you can have whatever hardware you want, but you have to pay the maintenance costs that come with that.

I don't think there is a budget for or interest in maintaining a hosted runner.

@rickwebiii
Copy link
Author

After doing a bit of research, it might be possible to use the mesa lavapipe driver to run the compute kernels on the CPU. This needs a bunch of investigation, but I'll post an update if I get a test of this working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants