Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimised modular arithmetic targeting elliptic curve operations #86

Merged
merged 50 commits into from
Oct 25, 2021

Conversation

daosvik
Copy link
Contributor

@daosvik daosvik commented Aug 13, 2021

In the context of verkle trees we need to optimise modular arithmetic for elliptic curve calculations.

This PR introduces speed optimisations for moduli from 193 to 256 bits in length:

  • it does not use division when adding already-reduced operands, which is typical in applications of modular arithmetic
  • it uses Barrett reduction, caching the mu value (the reciprocal of the modulus) for each modulus encountered, removing the need to perform a division for each reduction

@gballet
Copy link
Contributor

gballet commented Aug 19, 2021

Before:

goos: freebsd
goarch: amd64
pkg: github.com/holiman/uint256
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkMulMod
BenchmarkMulMod/small/uint256
BenchmarkMulMod/small/uint256-8         	24091966	        51.95 ns/op
BenchmarkMulMod/small/big
BenchmarkMulMod/small/big-8             	13057272	        93.92 ns/op
BenchmarkMulMod/mod64/uint256
BenchmarkMulMod/mod64/uint256-8         	 6775998	       162.3 ns/op
BenchmarkMulMod/mod64/big
BenchmarkMulMod/mod64/big-8             	 3723324	       324.2 ns/op
BenchmarkMulMod/mod128/uint256
BenchmarkMulMod/mod128/uint256-8        	 4329858	       247.9 ns/op
BenchmarkMulMod/mod128/big
BenchmarkMulMod/mod128/big-8            	 2085337	       693.0 ns/op
BenchmarkMulMod/mod192/uint256
BenchmarkMulMod/mod192/uint256-8        	 3751779	       318.3 ns/op
BenchmarkMulMod/mod192/big
BenchmarkMulMod/mod192/big-8            	 1637002	       760.2 ns/op
BenchmarkMulMod/mod256/uint256
BenchmarkMulMod/mod256/uint256-8        	 3333724	       343.7 ns/op
BenchmarkMulMod/mod256/big
BenchmarkMulMod/mod256/big-8            	 1535760	       763.2 ns/op
PASS
ok  	github.com/holiman/uint256	15.822s

After:

goos: freebsd
goarch: amd64
pkg: github.com/holiman/uint256
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkMulMod
BenchmarkMulMod/small/uint256
BenchmarkMulMod/small/uint256-8         	25852512	        46.44 ns/op
BenchmarkMulMod/small/big
BenchmarkMulMod/small/big-8             	13739356	        88.47 ns/op
BenchmarkMulMod/mod64/uint256
BenchmarkMulMod/mod64/uint256-8         	 6403490	       196.1 ns/op
BenchmarkMulMod/mod64/big
BenchmarkMulMod/mod64/big-8             	 3167760	       394.4 ns/op
BenchmarkMulMod/mod128/uint256
BenchmarkMulMod/mod128/uint256-8        	 3329642	       338.7 ns/op
BenchmarkMulMod/mod128/big
BenchmarkMulMod/mod128/big-8            	 1510978	       829.5 ns/op
BenchmarkMulMod/mod192/uint256
BenchmarkMulMod/mod192/uint256-8        	 3204769	       357.8 ns/op
BenchmarkMulMod/mod192/big
BenchmarkMulMod/mod192/big-8            	 1512756	       778.7 ns/op
BenchmarkMulMod/mod256/uint256
BenchmarkMulMod/mod256/uint256-8        	 5124606	       233.6 ns/op
BenchmarkMulMod/mod256/big
BenchmarkMulMod/mod256/big-8            	 1604929	       720.6 ns/op
    benchmarks_test.go:712: Hits/Misses	4788461/1349390
PASS
ok  	github.com/holiman/uint256	17.042s

@gballet
Copy link
Contributor

gballet commented Aug 19, 2021

Running benchstat:

(master)> go test -v -run none -bench MulMod -count 5  -benchmem > master.txt
(master)> git checkout opt_modarith
(opt_modarith)> go test -v -run none -bench MulMod -count 5  -benchmem > daosvik.txt
> benchstat master.txt daosvik.txt 
name                     old time/op    new time/op    delta
MulMod/small/uint256-8     46.0ns ±28%    38.7ns ± 2%     ~     (p=0.063 n=5+4)
MulMod/small/big-8          115ns ± 8%     118ns ± 6%     ~     (p=0.460 n=5+5)
MulMod/mod64/uint256-8      174ns ± 8%     172ns ± 6%     ~     (p=1.000 n=5+5)
MulMod/mod64/big-8          332ns ±11%     331ns ± 4%     ~     (p=0.548 n=5+5)
MulMod/mod128/uint256-8     280ns ± 6%     282ns ± 0%     ~     (p=0.730 n=5+4)
MulMod/mod128/big-8         699ns ± 3%     755ns ±12%     ~     (p=0.056 n=5+5)
MulMod/mod192/uint256-8     287ns ± 8%     377ns ±17%  +31.49%  (p=0.008 n=5+5)
MulMod/mod192/big-8         675ns ± 3%     743ns ± 7%  +10.17%  (p=0.016 n=5+5)
MulMod/mod256/uint256-8     288ns ±10%     215ns ±12%  -25.47%  (p=0.008 n=5+5)
MulMod/mod256/big-8         659ns ± 8%     682ns ± 1%     ~     (p=0.190 n=5+4)

There seems to be a significant performance hit. These are on a very specific platform, though, I wonder what the results would be on more standard platforms?

@daosvik
Copy link
Contributor Author

daosvik commented Aug 19, 2021

Here are my MulMod benchmark results on amd64 and arm64, with go1.13.8 on Ubuntu 20.04.3 LTS.

6-core Ryzen 5 3600 running at 3.6 GHz (boost is off):

MulMod/small/uint256-12     30.7ns ± 0%    31.4ns ± 0%   +2.41%  (p=0.016 n=5+4)
MulMod/small/big-12         69.9ns ± 2%    70.5ns ± 1%     ~     (p=0.254 n=5+5)
MulMod/mod64/uint256-12     93.7ns ± 1%    95.6ns ± 0%   +1.99%  (p=0.008 n=5+5)
MulMod/mod64/big-12          247ns ± 1%     248ns ± 1%     ~     (p=0.833 n=5+5)
MulMod/mod128/uint256-12     167ns ± 0%     164ns ± 0%   -1.80%  (p=0.029 n=4+4)
MulMod/mod128/big-12         453ns ± 2%     467ns ± 2%   +2.91%  (p=0.016 n=5+5)
MulMod/mod192/uint256-12     160ns ± 0%     159ns ± 1%   -0.88%  (p=0.048 n=5+5)
MulMod/mod192/big-12         414ns ± 1%     420ns ± 1%     ~     (p=0.063 n=5+5)
MulMod/mod256/uint256-12     155ns ± 0%     108ns ± 0%  -30.32%  (p=0.008 n=5+5)
MulMod/mod256/big-12         397ns ± 2%     406ns ± 2%     ~     (p=0.087 n=5+5)

4-core Core i7-1065G7 running at 1.3 GHz:

MulMod/small/uint256-8     91.3ns ± 0%    90.2ns ± 6%     ~     (p=0.690 n=5+5)
MulMod/small/big-8          132ns ± 0%     130ns ± 0%   -1.52%  (p=0.000 n=5+4)
MulMod/mod64/uint256-8      212ns ± 0%     205ns ± 0%   -3.67%  (p=0.008 n=5+5)
MulMod/mod64/big-8          337ns ± 0%     337ns ± 0%     ~     (p=1.000 n=5+5)
MulMod/mod128/uint256-8     433ns ± 0%     424ns ± 0%   -2.12%  (p=0.016 n=5+4)
MulMod/mod128/big-8         805ns ± 0%     799ns ± 0%   -0.77%  (p=0.008 n=5+5)
MulMod/mod192/uint256-8     418ns ± 1%     412ns ± 0%   -1.44%  (p=0.016 n=5+4)
MulMod/mod192/big-8         785ns ±14%     752ns ± 0%     ~     (p=0.056 n=5+5)
MulMod/mod256/uint256-8     412ns ± 0%     320ns ±11%  -22.23%  (p=0.016 n=4+5)
MulMod/mod256/big-8         719ns ± 0%     715ns ± 1%     ~     (p=0.127 n=5+5)

4-core Cortex-A72 running at 1.5 GHz (Raspberry Pi 4):

MulMod/small/uint256-4      186ns ± 0%     185ns ± 0%   -0.54%  (p=0.008 n=5+5)
MulMod/small/big-4          375ns ± 0%     375ns ± 1%     ~     (p=0.841 n=5+5)
MulMod/mod64/uint256-4      416ns ± 0%     413ns ± 0%   -0.86%  (p=0.008 n=5+5)
MulMod/mod64/big-4         1.07µs ± 1%    1.08µs ± 1%     ~     (p=0.095 n=5+5)
MulMod/mod128/uint256-4     741ns ± 0%     736ns ± 0%   -0.78%  (p=0.008 n=5+5)
MulMod/mod128/big-4        2.01µs ± 2%    2.01µs ± 2%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-4     746ns ± 0%     742ns ± 0%   -0.56%  (p=0.016 n=5+5)
MulMod/mod192/big-4        1.89µs ± 1%    1.91µs ± 1%   +0.95%  (p=0.008 n=5+5)
MulMod/mod256/uint256-4     721ns ± 0%     612ns ± 0%  -15.17%  (p=0.008 n=5+5)
MulMod/mod256/big-4        1.78µs ± 1%    1.79µs ± 1%     ~     (p=0.690 n=5+5)

@holiman
Copy link
Owner

holiman commented Aug 19, 2021

on cpu: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz

go version
go version go1.16.4 linux/amd64
name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.8ns ± 3%    36.3ns ± 1%     ~     (p=0.206 n=5+5)
MulMod/small/big-6         70.1ns ±13%    65.9ns ± 4%     ~     (p=0.238 n=5+5)
MulMod/mod64/uint256-6      117ns ± 9%     105ns ± 3%  -10.22%  (p=0.008 n=5+5)
MulMod/mod64/big-6          238ns ± 1%     216ns ± 1%   -9.30%  (p=0.008 n=5+5)
MulMod/mod128/uint256-6     211ns ± 9%     173ns ± 0%  -18.18%  (p=0.016 n=5+4)
MulMod/mod128/big-6         572ns ±11%     504ns ± 3%  -11.96%  (p=0.032 n=5+5)
MulMod/mod192/uint256-6     195ns ± 4%     194ns ± 1%     ~     (p=0.841 n=5+5)
MulMod/mod192/big-6         563ns ± 6%     474ns ± 2%  -15.91%  (p=0.008 n=5+5)
MulMod/mod256/uint256-6     226ns ± 7%     124ns ± 3%  -45.02%  (p=0.008 n=5+5)
MulMod/mod256/big-6         518ns ± 5%     451ns ± 5%  -12.94%  (p=0.008 n=5+5)

So that also looks better than @gballet 's numbers

@holiman
Copy link
Owner

holiman commented Aug 19, 2021

with go1.13.8 on Ubuntu 20.04.3 LTS.

@daosvik you should definitely update that, it's very old, and your benchmarks may be totally off.

@gballet
Copy link
Contributor

gballet commented Aug 19, 2021

On an AWS x5.large instance, I get:

name                     old time/op    new time/op    delta
MulMod/small/uint256-4     44.0ns ± 0%    44.1ns ± 0%   +0.35%  (p=0.008 n=5+5)
MulMod/small/big-4         74.0ns ± 0%    74.8ns ± 0%   +1.06%  (p=0.008 n=5+5)
MulMod/mod64/uint256-4      127ns ± 0%     127ns ± 0%   -0.21%  (p=0.024 n=5+5)
MulMod/mod64/big-4          243ns ± 0%     245ns ± 0%   +0.75%  (p=0.008 n=5+5)
MulMod/mod128/uint256-4     207ns ± 0%     208ns ± 0%   +0.45%  (p=0.016 n=4+5)
MulMod/mod128/big-4         516ns ± 0%     520ns ± 0%   +0.73%  (p=0.008 n=5+5)
MulMod/mod192/uint256-4     212ns ± 0%     212ns ± 0%     ~     (p=0.087 n=5+5)
MulMod/mod192/big-4         493ns ± 0%     497ns ± 0%   +0.82%  (p=0.008 n=5+5)
MulMod/mod256/uint256-4     212ns ± 0%     132ns ± 0%  -37.80%  (p=0.000 n=5+4)
MulMod/mod256/big-4         466ns ± 0%     470ns ± 0%   +0.79%  (p=0.008 n=5+5)

@holiman
Copy link
Owner

holiman commented Aug 19, 2021

@daosvik another thing -- you're adding/changing some benchmarks. It would help if you moved those changes so that your very first commit is the benchmark-change, and after that, the commits that modify the code.

That way, it's possible to compare benchmarks for "master" (master + new/modified benches) against PR, by just jumping between commits. Which is pretty difficult right now, when the bench-changes are mixed with code changes.

@daosvik
Copy link
Contributor Author

daosvik commented Aug 19, 2021

@daosvik another thing -- you're adding/changing some benchmarks. It would help if you moved those changes so that your very first commit is the benchmark-change, and after that, the commits that modify the code.

That way, it's possible to compare benchmarks for "master" (master + new/modified benches) against PR, by just jumping between commits. Which is pretty difficult right now, when the bench-changes are mixed with code changes.

Sorry about that - I almost got it right. The first commit adds an unnecessary printout of cache stats, which you can delete or comment out to generate the baseline benchmark.

@gballet
Copy link
Contributor

gballet commented Aug 19, 2021

Right, so I hadn't seen that the resource-heavy go language server was running in the background. Without it, I do get results that are closer to expectation:

name                     old time/op    new time/op    delta
MulMod/small/uint256-8     52.2ns ±24%    52.5ns ±43%     ~     (p=1.000 n=5+5)
MulMod/small/big-8          113ns ± 5%     113ns ± 8%     ~     (p=1.000 n=5+5)
MulMod/mod64/uint256-8      168ns ± 9%     172ns ±13%     ~     (p=0.841 n=5+5)
MulMod/mod64/big-8          335ns ± 7%     364ns ±23%     ~     (p=0.222 n=5+5)
MulMod/mod128/uint256-8     278ns ± 4%     275ns ±18%     ~     (p=0.310 n=5+5)
MulMod/mod128/big-8         722ns ± 7%     738ns ±11%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-8     291ns ± 6%     310ns ± 4%     ~     (p=0.095 n=5+5)
MulMod/mod192/big-8         685ns ± 7%     659ns ± 4%     ~     (p=0.548 n=5+5)
MulMod/mod256/uint256-8     295ns ± 8%     192ns ± 2%  -35.01%  (p=0.008 n=5+5)
MulMod/mod256/big-8         644ns ±10%     657ns ±13%     ~     (p=0.548 n=5+5)

mod.go Outdated Show resolved Hide resolved
mod.go Outdated
// return
//}

func shiftleft(x Int, s uint) (z Int) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both this method and the existing function

// Lsh sets z = x << n and returns z.
func (z *Int) Lsh(x *Int, n uint) *Int {

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need both, but there was a performance improvement over Lsh on Ryzen, so I want to do some more benchmarking before removing one in favour of the other.

mod.go Outdated Show resolved Hide resolved
mod.go Outdated
// - otherwise, the result is normalised to have non-zero most significant word
// - starts with a 32-bit division, refines with newton-raphson iterations

func reciprocal(m Int) (mu [5]uint64) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this implementation is .. fairly nontrivial ... it would be good if you can add some links to the origin of the algorithm, so the astute reader can have something to verify against

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this algorithm for this specific purpose, so there is no paper or website to refer to as the origin of the algorithm.
It's really only computing the reciprocal of the modulus with 320-bit precision, taking care to avoid overflow.

mod.go Outdated
return
}

cache[cacheIndex].miss++
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this an unsafe concurrent modification, since you're only holding the readlock ? x++ is not atomic

(and if so, same a few lines above with hit++)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is unsafe, except it is not very important to have an exact result for these cache statistics.

@holiman
Copy link
Owner

holiman commented Aug 19, 2021

Another benchmark -- if the cache is disabled:

name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.8ns ± 3%    36.6ns ± 3%     ~     (p=0.310 n=5+5)
MulMod/small/big-6         70.1ns ±13%    65.7ns ± 1%     ~     (p=0.310 n=5+5)
MulMod/mod64/uint256-6      117ns ± 9%     106ns ± 2%  -10.12%  (p=0.008 n=5+5)
MulMod/mod64/big-6          238ns ± 1%     220ns ± 5%   -7.70%  (p=0.008 n=5+5)
MulMod/mod128/uint256-6     211ns ± 9%     171ns ± 0%  -18.93%  (p=0.016 n=5+4)
MulMod/mod128/big-6         572ns ±11%     504ns ± 2%  -11.85%  (p=0.016 n=5+5)
MulMod/mod192/uint256-6     195ns ± 4%     196ns ± 2%     ~     (p=0.548 n=5+5)
MulMod/mod192/big-6         563ns ± 6%     483ns ± 5%  -14.20%  (p=0.008 n=5+5)
MulMod/mod256/uint256-6     226ns ± 7%     182ns ± 2%  -19.40%  (p=0.008 n=5+5)
MulMod/mod256/big-6         518ns ± 5%     451ns ± 3%  -12.81%  (p=0.008 n=5+5)

And the difference between with-cache and without-cache:

[user@work uint256]$ benchstat daosvik.txt daosvik_nocache.txt 
name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.3ns ± 1%    36.6ns ± 3%     ~     (p=0.286 n=5+5)
MulMod/small/big-6         65.9ns ± 4%    65.7ns ± 1%     ~     (p=1.000 n=5+5)
MulMod/mod64/uint256-6      105ns ± 3%     106ns ± 2%     ~     (p=0.452 n=5+5)
MulMod/mod64/big-6          216ns ± 1%     220ns ± 5%     ~     (p=0.167 n=5+5)
MulMod/mod128/uint256-6     173ns ± 0%     171ns ± 0%   -0.92%  (p=0.029 n=4+4)
MulMod/mod128/big-6         504ns ± 3%     504ns ± 2%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-6     194ns ± 1%     196ns ± 2%     ~     (p=0.548 n=5+5)
MulMod/mod192/big-6         474ns ± 2%     483ns ± 5%     ~     (p=0.310 n=5+5)
MulMod/mod256/uint256-6     124ns ± 3%     182ns ± 2%  +46.58%  (p=0.008 n=5+5)
MulMod/mod256/big-6         451ns ± 5%     451ns ± 3%     ~     (p=1.000 n=5+5)

@daosvik daosvik marked this pull request as ready for review August 19, 2021 14:58
mod.go Outdated Show resolved Hide resolved
mod.go Outdated

// Check for reciprocal in the cache

cacheIndex := (m[3]^m[2]^m[1]^m[0]) & cacheMask
Copy link
Owner

@holiman holiman Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that the entirety of the caching can be moved out of this function. So we have

func reciprocal(...){
...
if cached, ok := cache.has(m); ok{
    return cached
}
...
// do calc
cache.store(m, mu)
return mu
}

Then you'd also have to define a new type:

type reciprocalCache  [cacheSets]cacheSet
var cache reciprocalCache

func( c reciprocalCache) has(...) bool{
...
}

That de-clutters things a bit. It should not be a loss of speed, if things are inlined properly by the compiler.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you could also move cacheStats and make it cache.stats(), which is nicer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it absolutely killed performance no matter my attempts at massaging the code

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I just pushed a change, which so far just makes the cachesstats separate. I'll look into if I can massage the second change so it doesn't affect performance

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, my first change doesn't affect performance (I'm using BenchmarkMulMod/mod256/uint256-6 to check it)

@holiman
Copy link
Owner

holiman commented Aug 20, 2021

@chfast would you mind taking a peek at this too?

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed AddMod. I prefer to handle AddMod first as it is much simpler. Also benchmark changes can be merged first before functional changes.

I tried to optimize AddMod for similar cases but it made the worst case up to 2x slower. However, here the change is different so I will revisit this. https://github.com/chfast/intx/pull/206/files

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

uint256.go Outdated

if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) {
if z, overflow := z.AddOverflow(x, y); overflow {
z.Sub(z, m)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an exact equivalent, no. This code handles addition for m >= 2^192, with x,y <= m+2^192-1. This includes already-reduced values for x and y, that is 0 <= x,y < m.

uint256.go Outdated
for {
t := *z
if _, overflow := t.SubOverflow(&t, m); overflow {
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times this loop can run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case here is m = 2^192, x = y = 2m-1, so x+y = 4m-2. Hence the loop is exited at the latest in the fourth iteration. For already-reduced inputs, this happens no later than the second iteration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be added to the code.

Does this procedure has any name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.
This procedure doesn't have any particular name, it's just repeated subtraction of the modulus until we have the smallest non-negative residue.

uint256.go Outdated
// Fast path for m >= 2^192, with x and y at most slightly bigger than m.
// This is always the case when x and y are already reduced modulo such m.

if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only handles happy case, but maybe if m[3] < x[3] maybe it would be better to perform x mod m and then also use the more efficient implementation. Same for y mod m.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy case indeed, and also the normal case for modular arithmetic. :)

The code falls back to the slower division-based code for other cases, and will always return a fully-reduced result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can see now we cannot efficiently use SubOverflow to reduce x here.

uint256.go Outdated Show resolved Hide resolved
@holiman
Copy link
Owner

holiman commented Aug 23, 2021

Also benchmark changes can be merged first before functional changes.

Yeah, I noted the same. @daosvik can you make a separate PR with the benchmark changes, and then we can rebase this one on top of that one (or on top of master after we merge it). If not, I can do it myself too.

@gballet
Copy link
Contributor

gballet commented Aug 23, 2021

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

So this PR seeks to accelerate the specific case of doing modulo operations for a specific case, namely prime field operations used for elliptic curves. This isn't connected to the speed ADDMOD and MULMOD EVM instructions, but the calculations of the root commitment (equivalent to the root hash), so it can not really be addressed with gas costs.

@chfast
Copy link
Collaborator

chfast commented Aug 23, 2021

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

So this PR seeks to accelerate the specific case of doing modulo operations for a specific case, namely prime field operations used for elliptic curves. This isn't connected to the speed ADDMOD and MULMOD EVM instructions, but the calculations of the root commitment (equivalent to the root hash), so it can not really be addressed with gas costs.

To confirm: you want to use these methods outside of EVM.

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There is a bug in AddMod(). You can find failing tests in tests: more AddMod() test cases #89.
  • The test TestRandomMulMod is running forever what indicates there may be also bug in MulMod().
  • @holiman, the CI testing does not run for external PR, this should be enabled.

@holiman
Copy link
Owner

holiman commented Aug 24, 2021

@holiman, the CI testing does not run for external PR, this should be enabled.

I think this is addressed now.

@holiman
Copy link
Owner

holiman commented Aug 24, 2021

Some remaining tasks for this PR:

  • Rebase on top of latest master (benchmarks are now in master)
  • Move the caching implementation so it's not part of reciprocal
  • Look into the test failures which @chfast reported

@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

Merging #86 (295482d) into master (71c9c37) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #86    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         4     +1     
  Lines         1047      1364   +317     
==========================================
+ Hits          1047      1364   +317     

)

s = *x
if _, overflow = s.SubOverflow(&s, m); overflow {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, overflow = s.SubOverflow(&s, m); overflow {
if _, overflow = s.SubOverflow(x, m); overflow {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is good, but somehow this causes the compiler (1.17) to slow down AddMod/mod192 by more than 17%

}

t = *y
if _, overflow = t.SubOverflow(&t, m); overflow {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, overflow = t.SubOverflow(&t, m); overflow {
if _, overflow = t.SubOverflow(y, m); overflow {

uint256.go Outdated Show resolved Hide resolved
}

t = s
if _, overflow = s.SubOverflow(&s, m); overflow {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also AddOverflow and SubOverflow one after another and then select the result based on the two overflow flags. As in https://github.com/chfast/intx/pull/206/files#diff-56937962f5980bbff23d21ab8da34df2b7df07b21546691156a2e7a79d763f15R1026-R1028.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion again, but the time goes up about 30% with this

@gballet
Copy link
Contributor

gballet commented Aug 26, 2021

To confirm: you want to use these methods outside of EVM.

Our primary target is indeed to compute the state root faster. If the EVM can benefit from that, then all the better.

@holiman
Copy link
Owner

holiman commented Aug 27, 2021

I go-formatted the files. I'm a bit surprised that the CI linter didn't complain about this earlier.

@holiman
Copy link
Owner

holiman commented Aug 27, 2021

This PR is now rebased on master. @daosvik , you'll need to git pull, which will tell you it can't be applied, then do a git reset --hard origin/opt_modarith .

If you want to keep your old branch, you can first stash it away, like git checkout -b old_op_modarith; git checkout opt_modarith

@daosvik
Copy link
Contributor Author

daosvik commented Aug 27, 2021

I go-formatted the files. I'm a bit surprised that the CI linter didn't complain about this earlier.

Uh-oh. I'll have to postpone that till the code is ready, as this reduces my productivity.

@holiman
Copy link
Owner

holiman commented Sep 21, 2021

Even without the cache, I still see a speed up:

name                     old time/op  new time/op  delta
MulMod/small/uint256-6   32.0ns ± 1%  32.4ns ± 4%     ~     (p=0.064 n=8+10)
MulMod/mod64/uint256-6   74.3ns ± 1%  75.7ns ± 1%   +1.87%  (p=0.000 n=8+9)
MulMod/mod128/uint256-6   122ns ± 2%   121ns ± 3%     ~     (p=0.092 n=9+8)
MulMod/mod192/uint256-6   149ns ± 4%   148ns ± 0%     ~     (p=0.813 n=9+9)
MulMod/mod256/uint256-6   187ns ± 5%   160ns ± 9%  -14.26%  (p=0.000 n=9+10)
MulMod/small/big-6       61.3ns ±17%  57.2ns ± 1%     ~     (p=0.780 n=10+9)
MulMod/mod64/big-6        112ns ± 9%   113ns ± 6%     ~     (p=0.725 n=10+10)
MulMod/mod128/big-6       365ns ± 7%   357ns ± 2%     ~     (p=0.182 n=10+9)
MulMod/mod192/big-6       408ns ± 1%   416ns ± 6%     ~     (p=0.779 n=9+9)
MulMod/mod256/big-6       473ns ± 4%   485ns ± 6%     ~     (p=0.280 n=10+10)

@holiman
Copy link
Owner

holiman commented Sep 21, 2021

And also for addmod:

[user@work uint256]$ benchstat addmod_before.txt addmod_after.txt 
name                     old time/op  new time/op  delta
AddMod/small/uint256-6   12.0ns ± 2%  12.2ns ± 2%   +1.65%  (p=0.016 n=10+9)
AddMod/mod64/uint256-6   19.3ns ± 3%  19.4ns ± 0%     ~     (p=0.055 n=9+8)
AddMod/mod128/uint256-6  43.3ns ± 2%  42.0ns ± 6%     ~     (p=0.661 n=9+10)
AddMod/mod192/uint256-6  46.6ns ± 1%  46.9ns ± 1%   +0.59%  (p=0.023 n=9+9)
AddMod/mod256/uint256-6  49.8ns ± 1%  19.8ns ± 4%  -60.19%  (p=0.000 n=9+10)
AddMod/small/big-6       44.7ns ± 1%  43.4ns ± 1%   -2.92%  (p=0.000 n=9+9)
AddMod/mod64/big-6       54.8ns ± 1%  53.6ns ± 1%   -2.14%  (p=0.000 n=9+9)
AddMod/mod128/big-6       143ns ± 3%   141ns ± 1%   -1.87%  (p=0.001 n=10+9)
AddMod/mod192/big-6       144ns ± 2%   143ns ± 2%     ~     (p=0.305 n=10+9)
AddMod/mod256/big-6       144ns ± 1%   143ns ± 1%     ~     (p=0.868 n=9+8)

Copy link
Owner

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK now. Some follow-up tasks:

  • Adding the MulmodWithReciprocal and Reciprocal methods.
  • We should integrate uint256 with oss-fuzz, and let it run for a while before cutting a new release.
  • go fmt eventually

mod.go Outdated
return z
}

//func onesCount(x *Int) (z int) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

mod.go Outdated

if m[0] | m[1] | m[2] | (m[3] & (m[3]-1)) == 0 { // replace with commented code below for general case

// if onesCount(m) <= 1 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

@chfast
Copy link
Collaborator

chfast commented Oct 22, 2021

Haswell 4.4 GHz, go 1.16.

name                     old time/op  new time/op  delta
Mod/small/uint256-8      12.8ns ± 0%  12.7ns ± 0%   -0.49%  (p=0.000 n=8+9)
Mod/small/big-8          42.3ns ± 1%  42.9ns ± 1%   +1.35%  (p=0.000 n=10+9)
Mod/mod64/uint256-8      59.8ns ± 1%  59.8ns ± 0%     ~     (p=0.796 n=9+10)
Mod/mod64/big-8           110ns ± 1%   109ns ± 1%   -0.46%  (p=0.048 n=10+10)
Mod/mod128/uint256-8     85.7ns ± 1%  85.9ns ± 1%     ~     (p=0.243 n=9+10)
Mod/mod128/big-8          242ns ± 1%   242ns ± 1%     ~     (p=0.643 n=10+10)
Mod/mod192/uint256-8     80.2ns ± 0%  80.1ns ± 0%     ~     (p=0.651 n=9+9)
Mod/mod192/big-8          220ns ± 1%   219ns ± 1%     ~     (p=0.306 n=10+10)
Mod/mod256/uint256-8     68.4ns ± 0%  68.3ns ± 0%   -0.15%  (p=0.037 n=9+9)
Mod/mod256/big-8          189ns ± 1%   188ns ± 0%     ~     (p=0.237 n=10+10)
AddMod/small/uint256-8   13.4ns ± 0%  13.6ns ± 0%   +1.68%  (p=0.000 n=9+8)
AddMod/small/big-8       42.1ns ± 3%  41.9ns ± 2%     ~     (p=0.897 n=10+10)
AddMod/mod64/uint256-8   20.5ns ± 0%  20.7ns ± 0%   +1.09%  (p=0.000 n=9+9)
AddMod/mod64/big-8       50.9ns ± 3%  50.5ns ± 1%     ~     (p=0.643 n=10+10)
AddMod/mod128/uint256-8  39.3ns ± 0%  39.7ns ± 0%   +0.84%  (p=0.000 n=8+9)
AddMod/mod128/big-8       120ns ± 1%   119ns ± 1%   -0.63%  (p=0.012 n=10+10)
AddMod/mod192/uint256-8  42.4ns ± 0%  42.8ns ± 0%   +0.89%  (p=0.000 n=9+10)
AddMod/mod192/big-8       122ns ± 0%   122ns ± 1%     ~     (p=0.524 n=8+10)
AddMod/mod256/uint256-8  45.1ns ± 0%  15.8ns ± 0%  -65.01%  (p=0.000 n=8+9)
AddMod/mod256/big-8       124ns ± 0%   124ns ± 1%     ~     (p=0.614 n=10+10)
MulMod/small/uint256-8   33.0ns ± 0%  33.1ns ± 0%   +0.29%  (p=0.000 n=8+8)
MulMod/small/big-8       55.9ns ± 1%  55.7ns ± 1%     ~     (p=0.065 n=10+10)
MulMod/mod64/uint256-8   70.1ns ± 0%  70.3ns ± 1%   +0.38%  (p=0.006 n=8+10)
MulMod/mod64/big-8        101ns ± 1%   101ns ± 1%     ~     (p=0.231 n=9+10)
MulMod/mod128/uint256-8   104ns ± 0%   105ns ± 1%   +0.81%  (p=0.000 n=9+10)
MulMod/mod128/big-8       297ns ± 1%   296ns ± 1%     ~     (p=0.403 n=10+10)
MulMod/mod192/uint256-8   127ns ± 0%   127ns ± 0%     ~     (p=0.297 n=8+9)
MulMod/mod192/big-8       342ns ± 1%   341ns ± 1%     ~     (p=0.285 n=10+9)
MulMod/mod256/uint256-8   160ns ± 0%   149ns ± 0%   -6.46%  (p=0.001 n=8+6)
MulMod/mod256/big-8       390ns ± 1%   388ns ± 1%   -0.44%  (p=0.008 n=10+10)

@chfast
Copy link
Collaborator

chfast commented Oct 22, 2021

Skylake+ laptop, go 1.16

name                     old time/op  new time/op  delta
Mod/small/uint256-8      12.0ns ± 1%  12.1ns ± 0%   +0.69%  (p=0.005 n=9+9)
Mod/small/big-8          41.1ns ± 1%  42.5ns ± 4%   +3.40%  (p=0.001 n=9+10)
Mod/mod64/uint256-8      56.6ns ± 0%  57.2ns ± 2%     ~     (p=0.106 n=8+10)
Mod/mod64/big-8           109ns ± 6%   107ns ± 3%     ~     (p=0.109 n=10+10)
Mod/mod128/uint256-8     85.9ns ± 1%  86.2ns ± 1%     ~     (p=0.315 n=10+10)
Mod/mod128/big-8          236ns ± 2%   233ns ± 1%     ~     (p=0.093 n=10+10)
Mod/mod192/uint256-8     81.5ns ± 2%  82.9ns ± 4%     ~     (p=0.138 n=10+10)
Mod/mod192/big-8          214ns ± 2%   215ns ± 2%     ~     (p=0.060 n=10+10)
Mod/mod256/uint256-8     68.7ns ± 0%  69.0ns ± 1%     ~     (p=0.305 n=10+8)
Mod/mod256/big-8          189ns ± 2%   193ns ± 2%   +1.98%  (p=0.002 n=10+10)
AddMod/small/uint256-8   13.1ns ± 1%  14.1ns ± 1%   +7.86%  (p=0.000 n=10+9)
AddMod/small/big-8       41.6ns ± 2%  41.9ns ± 2%     ~     (p=0.148 n=10+10)
AddMod/mod64/uint256-8   19.9ns ± 1%  20.5ns ± 1%   +2.65%  (p=0.000 n=10+10)
AddMod/mod64/big-8       49.5ns ± 1%  49.8ns ± 1%     ~     (p=0.128 n=10+10)
AddMod/mod128/uint256-8  39.1ns ± 0%  39.3ns ± 0%   +0.47%  (p=0.000 n=8+10)
AddMod/mod128/big-8       117ns ± 1%   124ns ± 5%   +5.49%  (p=0.000 n=10+10)
AddMod/mod192/uint256-8  42.1ns ± 0%  42.3ns ± 1%     ~     (p=0.062 n=9+10)
AddMod/mod192/big-8       120ns ± 3%   122ns ± 2%   +2.04%  (p=0.002 n=10+10)
AddMod/mod256/uint256-8  44.9ns ± 0%  16.9ns ± 1%  -62.41%  (p=0.000 n=10+10)
AddMod/mod256/big-8       121ns ± 1%   123ns ± 2%   +1.68%  (p=0.000 n=8+10)
MulMod/small/uint256-8   32.5ns ± 1%  32.8ns ± 1%   +0.88%  (p=0.006 n=10+10)
MulMod/small/big-8       54.9ns ± 2%  54.8ns ± 1%     ~     (p=0.971 n=10+10)
MulMod/mod64/uint256-8   69.0ns ± 1%  69.4ns ± 1%   +0.59%  (p=0.035 n=10+10)
MulMod/mod64/big-8        100ns ± 2%   102ns ± 5%     ~     (p=0.055 n=8+10)
MulMod/mod128/uint256-8   105ns ± 1%   106ns ± 0%   +0.31%  (p=0.048 n=10+10)
MulMod/mod128/big-8       291ns ± 2%   290ns ± 3%     ~     (p=0.251 n=9+10)
MulMod/mod192/uint256-8   128ns ± 0%   127ns ± 0%   -0.50%  (p=0.001 n=9+9)
MulMod/mod192/big-8       334ns ± 1%   332ns ± 2%     ~     (p=0.474 n=9+9)
MulMod/mod256/uint256-8   156ns ± 0%   131ns ± 0%  -16.04%  (p=0.000 n=7+10)
MulMod/mod256/big-8       378ns ± 1%   381ns ± 3%     ~     (p=0.605 n=9+9)

// and returns z, using the reciprocal of m provided as the mu parameter.
// Use uint256.Reciprocal to calculate mu from m.
// If m == 0, z is set to 0 (OBS: differs from the big.Int)
func (z *Int) MulModWithReciprocal(x, y, m *Int, mu *[5]uint64) *Int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not tested at all. I think it should be enough to add it to any existing mulmod test case.

return z.Set(&r)
}

var (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I think I have found another issue. It looks like the Reciprocal() actually requires m[3] != 0 although the documentation says otherwise. E.g. try Reciprocal(2). This means that this code cannot be ever reached and is useless currently.

Not sure what @holiman thinks about it, but it may be simpler to just put if m[3] == 0 { panic } in Reciprocal().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been modified so that it is limited to m[3] != 0. Hence the first comment inside the function is "Note: specialized for m[3] != 0".
It's not so clear this way, so I'll adjust the comments preceding the function.

Copy link
Collaborator

@chfast chfast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still has some concerns about the API design but this can be easily changed separately later.

Thanks for good job and patience for reviewers.

// - starts with a 32-bit division, refines with newton-raphson iterations
func Reciprocal(m *Int) (mu [5]uint64) {

// Note: specialized for m[3] != 0
if m[3] == 0 {
return mu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best API design to fail silently instead of panic but it depends how this is going to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not so very elegant, but this way anyone using it can choose to use Reciprocal and MulModWithReciprocal for any modulus with upto 256 bits. Support for fast operations for shorter moduli can also be added later without any change required in client code.

@holiman holiman merged commit 38e2c56 into holiman:master Oct 25, 2021
@holiman
Copy link
Owner

holiman commented Oct 25, 2021

Thanks all!

chfast added a commit to chfast/intx that referenced this pull request Dec 17, 2021
In elliptic curve context the x and y arguments are already reduced
modulo mod. Here we can pick up similar condition and provide optimized
implementation for such case. This case is 2x faster with little
overhead for other cases.

Based on holiman/uint256#86.
chfast added a commit to chfast/intx that referenced this pull request Dec 17, 2021
In elliptic curve context the x and y arguments are already reduced
modulo mod. Here we can pick up similar condition and provide optimized
implementation for such case. This case is 2x faster with little
overhead for other cases.

Based on holiman/uint256#86.
chfast added a commit to chfast/intx that referenced this pull request Dec 17, 2021
In elliptic curve context the x and y arguments are already reduced
modulo mod. Here we can pick up similar condition and provide optimized
implementation for such case. This case is 2x faster with little
overhead for other cases.

Based on holiman/uint256#86.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants