Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARMv7 optimization #213

Open
timethernom opened this issue Apr 28, 2024 · 19 comments
Open

ARMv7 optimization #213

timethernom opened this issue Apr 28, 2024 · 19 comments

Comments

@timethernom
Copy link

I would like to speed up signature verification using ARMv7 instruction set (ARM4 or ARM33 device). Verification takes about 3 seconds with BLST_PORTABLE build. Could you hazard a guess on how much speed up I could get with an ARMv7 asm implementation?

By compare, ECDSA signature verification take about 75 msec with asm from https://github.com/kmackay/micro-ecc, using about 40K of static lookup tables. Am I correct that there is less role for lookup tables with BLS?

@mratsim
Copy link
Contributor

mratsim commented Apr 29, 2024

Pairings (BLS signatures) are significantly slower than scalar multiplications (ECDSA).
In my old BLST benches: status-im/nim-blst#1
there was a ratio of 6x and this is without the need of double pairings and hashing-to-curve for BLS verification.

If you look at micro-ecc, they use the ARM crypto DSP with UMAAL instruction: https://github.com/kmackay/micro-ecc/blob/master/asm_arm_mult_square_umaal.inc

This paper https://eprint.iacr.org/2018/700.pdf has some timings using generic Montgomery-based field multiplication (like BLST) with Neon (it also goes over UMAAL) and the ratio is 6.5x:
image

If you want to speedup signature verifications, maybe explore batching the verification, if you can verify 20+ at once, the total verification time should be lower than 20+ ECDSA signatures

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

NEON

Since there is no such thing as "ARM4 or ARM33 device," it's more likely that the question is rather about Cortex-M4/M33, micro-controllers, which don't have the NEON unit. And with UMAAL being optional on M33, so it's safer to assume that it's not an option either. This is not to say that assembly won't give meaningful improvement, but it won't be as impressive as suggested.

verify

One can wonder why would one want to task a micro-controller with such a heavy operation as BLS signature verification. But either way, as pointed out by @mratsim, there is no way to get BLS anywhere near ECDSA in terms of wall-clock performance. Even signature generation is computationally heavier, let alone verification.

@timethernom
Copy link
Author

Right, Cortex-M4/M33 is what I meant, and can require UMAAL if it delivers significant gain. Application is to deliver consensus proof to a hardware wallet. Probably will use schnorr threshold signature scheme like FROST, but would like to avoid extra steps that may fail. 3 seconds is too slow. .5 seconds might be viable. Signing and aggregation speed is not a concern.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

can require UMAAL if it delivers significant gain.

Depends on definition of "significant", but if taking it from 3s to 0.5s is the goal, UMAAL won't get you there, no.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

BTW, how did you compile it exactly? Do you use say Rust bindings or make raw C calls?

@timethernom
Copy link
Author

I just added server.c to a Keil project. My (out of date) Keil compiler threw some warnings on the inline functions calling static functions in no_asm.h, and then crashed. I got it to build by removing the static function declarations in no_asm.h.

This was just to see if I could get in the ballpark. I expected not, given RFC showing 30x slower than ECDSA. But, the blst lib was such a smooth build and test on MacOS, I decided worth checking out on one of my devices.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

I just added server.c to a Keil project.

Question is what was the optimization level. A possible pitfall is that assembler-assisted ECDSA won't be that dependent on optimization level, while pure C would be very dependent. Secondly, if your target executable format is ELF, you should be able to try newer gcc to compile libblst.a and link your project with it. The thing is that blst doesn't have explicit dependencies on language run-time, which should give you freedom to mix the compilers. It would be good idea to add -freestanding flag, e.g. as ./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding. What I'm trying to ensure is that 3s baseline is actually representative.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

./build.sh CROSS_COMPILE=arm-none-eabi- -ffreestanding

Hmm, gcc leaves reference to memset (without us calling it) and __aeabi_uidivmod. ./build.sh CC=clang --target=arm-none-eabi -ffreestanding is another option. This still leaves reference to __aeabi_uidivmod, which originates in multi_scalar.c. If it's a problem, -ffunction-sections is likely to remedy it. Try even -O3...

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 29, 2024

Oh! And of course you'd need to throw in -mcpu=cortex-m4 option.

BTW, clang even issues umaal instructions, so that assembly won't give you an edge is respect to hardware capabilities utilization. This is still not to say that it won't give meaningful improvement, but it won't be direct result of utilizing "magic" instructions.

@timethernom
Copy link
Author

Thanks! I will try this and see if any performance difference. This was with -O3 optimization.

@timethernom
Copy link
Author

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 30, 2024

Easier said than done. Keil does not like the library format, so will have to dig around on what is required there.

Giving it server.o might work better...

I guess the main acceleration in the kmackay ECDSA comes from the 20K look up table. In my intended bls application, the same public key will be used repeatedly. Is there any possibility to take advantage of that? It's not the batch scenario, because the signatures arrive over time. But, there would be room to cache some pre-computes.

You make it sound as if the lookup table makes ECDSA 6 times faster. It's not. It does yield meaningful improvement, but it's not "magical." BLS is just significantly computationally heavier than ECDSA. But to the point, while there is a thing that can be pre-computed in BLS, the gain is marginal, it won't get you anywhere close to where you want to be would hardly tip the scales.

@dot-asm
Copy link
Collaborator

dot-asm commented May 1, 2024

BLS is just significantly computationally heavier than ECDSA.

For an embedded system the correct answer is likely to be an integrated co-processor. One wishes OpenTitan was an accessible option, but it doesn't seem to be the case. Otherwise there is for example Kinetis K8x with usable capabilities in the Cortex-M4 space...

@timethernom
Copy link
Author

The Kinetis K8x looks almost identical to what we are currently running on, an Infineon PSoC62. We are currently running the 150Mhz M4 at 48Mhz to save power, so there's a 3x gain available there. It also has "12 programmable logic blocks, each with 8 Macrocells and an 8-bit data path (called universal digital blocks or UDBs)". Not sure if that can be of any use.

The kmackay ECDSA with uECC_OPTIMIZATION_LEVEL 0 takes 10 sec to verify 1 signature. The big leap comes at level 1, which drops to 170 msec, with no change in image size. This drops to 75 msec at level 3 with 40K jump in image size. So, I guess the precomputes deliver about 2x. The big win was something else. Probably level 0 is just a very simplistic reference implementation.

Using server.o revealed the need to add -fshort-enum to the build, for compatibility with our existing compile options. But, then I got a bunch of link errors about relocating objects, mostly related to stdio. This is mysterious. Will investigate further some time.

@dot-asm
Copy link
Collaborator

dot-asm commented May 2, 2024

But, then I got a bunch of link errors about relocating objects, mostly related to stdio.

As already mentioned, server.c has to no dependencies on C run-time, i.e. no references to stdio or whatever std. But ./build.sh does compile it with -fPIC by default, and it's not impossible to imagine that it might disorient your linker and confuse it into blaming everybody for its failure. Add -fno-pic to override (or don't pass -fPIC if you compile server.c yourself).

@timethernom
Copy link
Author

No doubt the linker was confused. Removing -fPIC avoided that. But, no significant performance difference. clang -O2 was about 50 msec slower than the keil -03 build. clang -03 was about 50 msec faster. Based on a single sample. Thanks for your help.

@dot-asm
Copy link
Collaborator

dot-asm commented May 3, 2024

To summarize. It's the ability to keep whole vectors in the register bank that translates to performance "leaps". Most notably one can barely accommodate a 256-bit vector in ARM32 registers(*), but not a 384-bit one. BLS operates on 384-bit inputs, which is basically why one can't expect times-impressive-N improvements no matter what you do here. This is still not to say that assembly won't provide meaningful improvement, but it won't take you anywhere near the stated objective.

As for Kinetis K8x. K82 is the only one with relevant capabilities, more specifically a modular arithmetic unit. However, I don't actually have hands-on experience with it, so don't consider the mention as "this one will actually solve the problem."

(*) Which is what the referred micro-ecc does with high enough uECC_OPTIMIZATION_LEVEL.

@dot-asm
Copy link
Collaborator

dot-asm commented May 5, 2024

don't consider the [Kinetis K82] mention as "this one will actually solve the problem."

Just in case. One has to recognize that the co-processor is still power-constrained, which in practice means that it won't run as fast as one wishes. It surely can deliver significant improvements on "large" operations such as modular exponentiation and ECC arithmetic, but the most computationally intensive part of BLS signature verification would have to rely on individual field operations, multiplications, additions, etc. And it's not impossible to imagine that the overhead of frequent data transfers to/from co-processor could sum up and diminish the wow factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants