Optimise multiplication #295

joelonsql · 2024-04-16T20:37:09Z

Hi maintainers of rust-num/num-bigint,

I've been hacking on PostgreSQL's numeric.c lately, implementing the Karatsuba algorithm [1].

During this work, I realised that when splitting by half the larger factor's size, then if the smaller factor is less than that, its high part will be zero. I realised that a variable that is zero usually means a formula can be simplified, so I tried to plug in the zero high1 in the formula, and voila, that eliminated a multiplication and most of the additions/subtractions, needing only one multiplication and an addition. This simplification felt too obvious to be novel, have looked around a bit but haven't found it so far. Should be mentioned on the Karatsuba Wikipedia or something I think.

I of course checked also Rust's implementation, and it also seems to benefit from this trick, hence this Pull Request.

Would be fun to know what you think.

Benchmark:

-test multiply_0           ... bench:          42 ns/iter (+/- 0)
+test multiply_0           ... bench:          42 ns/iter (+/- 0)

-test multiply_1           ... bench:       4,271 ns/iter (+/- 68)
+test multiply_1           ... bench:       4,242 ns/iter (+/- 164)

-test multiply_2           ... bench:     289,339 ns/iter (+/- 10,532)
+test multiply_2           ... bench:     288,857 ns/iter (+/- 8,715)

-test multiply_3           ... bench:     663,387 ns/iter (+/- 17,381)
+test multiply_3           ... bench:     583,635 ns/iter (+/- 14,997)

-test multiply_4           ... bench:       7,474 ns/iter (+/- 384)
+test multiply_4           ... bench:       6,649 ns/iter (+/- 164)

-test multiply_5           ... bench:      16,376 ns/iter (+/- 254)
+test multiply_5           ... bench:      13,922 ns/iter (+/- 323)

Thanks for great work!

/Joel

[1] https://www.postgresql.org/message-id/flat/7f95163f-2019-4416-a042-6e2141619e5d@app.fastmail.com

test multiply_0 ... bench: 42 ns/iter (+/- 0) test multiply_1 ... bench: 4,271 ns/iter (+/- 68) test multiply_2 ... bench: 289,339 ns/iter (+/- 10,532) test multiply_3 ... bench: 663,387 ns/iter (+/- 17,381) test multiply_4 ... bench: 7,474 ns/iter (+/- 384) test multiply_5 ... bench: 16,376 ns/iter (+/- 254)

Introduces Half-Karatsuba in multiplication module for cases where there is a significant size disparity between factors. Optimizes performance by evenly splitting the larger factor for more balanced multiplication calculations. -test multiply_0 ... bench: 42 ns/iter (+/- 0) +test multiply_0 ... bench: 42 ns/iter (+/- 0) -test multiply_1 ... bench: 4,271 ns/iter (+/- 68) +test multiply_1 ... bench: 4,242 ns/iter (+/- 164) -test multiply_2 ... bench: 289,339 ns/iter (+/- 10,532) +test multiply_2 ... bench: 288,857 ns/iter (+/- 8,715) -test multiply_3 ... bench: 663,387 ns/iter (+/- 17,381) +test multiply_3 ... bench: 583,635 ns/iter (+/- 14,997) -test multiply_4 ... bench: 7,474 ns/iter (+/- 384) +test multiply_4 ... bench: 6,649 ns/iter (+/- 164) -test multiply_5 ... bench: 16,376 ns/iter (+/- 254) +test multiply_5 ... bench: 13,922 ns/iter (+/- 323)

cuviper

Thanks! This does seem like a very straightforward improvement -- almost too obvious (like you said), but I can't find any fault with it. 😄

cuviper · 2024-04-16T21:54:14Z

src/biguint/multiplication.rs

+        // Add temp shifted by m2 to the accumulator
+        // This simulates the effect of multiplying temp by b^m2.
+        // Add directly starting at index m2 in the accumulator.
+        add2(&mut acc[m2..], &p.data);


Since each product is immediately added to acc, I think we don't even need the p buffer at all:

mac3(acc, x, low2); mac3(&mut acc[m2..], x, high2);

It doesn't change the benchmark times for me, but still, it's an easy allocation to avoid.

Right! Very nice. Thanks, I've pushed the simplification.

Sorry about the format error due to trailing newline, fixed and pushed.

Per suggestion from cuviper

cuviper · 2024-04-17T18:38:09Z

I suspect the reason this isn't normally considered is that it seems to add a layer of recursive multiplications. That is, this "half" step recurses to two "full" Karatsuba steps with the same x.len(), though halved y.len() in each. Maybe the reduced y still works in our favor in further recursion, or perhaps it has less obvious algorithmic effects like cache locality. Your new step would also intercept larger sizes that were headed for the Toom-3 branch, so maybe it's helping more there.

Whatever it is, the benchmarks do favor this change. I'm just trying to think if there are untested scenarios that this will make worse.

joelonsql · 2024-04-18T07:46:58Z

The pseudo-code in the Wikipedia article on Karatsuba for some reason splits on the longer factor:

    /* Calculates the size of the numbers. */
    m = max(size_base10(num1), size_base10(num2))
    m2 = floor(m / 2)

I noticed this code:

        // When x is smaller than y, it's significantly faster to pick b such that x is split in
        // half, not y:
        let b = x.len() / 2;

It seems that, with the Half-Karatsuba step, it doesn't matter if we in the Full-Karatsuba step, split x or y in half, I get almost identical benchmark results, to a degree that it is hard to tell which is the fastest.

The reason why Full-Karatsuba, without the half-step, when splitting on y is slower than x, is that one of the multiplications will be meaningless, since the x0 will be zero.

I think splitting the longer y in half-step(s), leads to fewer and more balanced smaller multiplications, than if instead splitting on the shorter x and do a full Karatsuba-step.

I've done quite a lot of benchmarks on my PostgreSQL numeric mul_var() patch, but not for num-bigint, so it would be good with a benchmark that covers a large range of smaller factors times a large range of larger factors, to better understand the effect for different factor lengths and factor length ratios, on different architectures.

In my PostgreSQL patch benchmark, I've measured the performance_ratio as defined as the execution time for the current implementation divided by the execution time for the patched version, by exposing both implementations in the same runtime, to allow increasing the number of executions exponentially, until the performance_ratio converge.

I think it would be fun to work together on developing a very capable benchmark suite for num-bigint, if you're up for it. For inspiration, I've attached a plot I created for my PostgreSQL patch, using Dynamic Programming to find the optimal threshold function, which as you can see is not trivial. The threshold line goes between the blue and magenta areas. The black line segment is a manually created threshold line, which I think captures the interesting performance gain area, without causing any significant performance regressions on the most prioritised architectures.

Important: The above plot is not for this patch, it's from my separate PostgreSQL patch, included here just for inspiration.

joelonsql · 2024-04-22T09:01:27Z

Hi @cuviper,

I've now done benchmarks across varying sizes of BigUint. Here's what I found: when the factor sizes are identical, there's no noticeable difference in the performance ratio. It's when we introduce a length disparity between factors that the performance ratio starts to climb, highlighting the Half Karatsuba's efficiency in handling size asymmetry.

Interestingly, keeping the larger factor constant and varying the smaller factor shows that the performance_ratio increases with the length disparity, but only to a certain point. Beyond this point, it begins to taper off, until it approaches 1.0 when factor sizes are equal.

This seems encouraging. Eager to hear your thoughts.

The benchmark below was produced using https://github.com/joelonsql/rust-timeit/blob/master/examples/num_bigint_half_karatsuba.rs

As shown in the plot, b_exp and c_exp have been varied from from 1 to 17.

joelonsql · 2024-04-22T11:34:47Z

Another benchmark with factors up to 1 << 24 bit_size

joelonsql · 2024-04-22T18:42:37Z

Here is another benchmark with 256 x 256 measurements, on a linear scale, and with the number of big_digits as the axes.

joelonsql · 2024-04-23T08:57:25Z

@cuviper wrote:

I'm just trying to think if there are untested scenarios that this will make worse.

To address this, I've examined the data points, with the lowest performance ratios from the last 49152 measurements:

b_big_digits | c_big_digits | performance_ratio
------------ | ------------ | -----------------
432          | 688          | 0.813122
16           | 16           | 0.956026
32           | 1616         | 0.972652
176          | 160          | 0.976072
32           | 96           | 0.977745

I then reran these tests with higher precision and additional runs to ensure reliability.

Here are the refined results:

b_big_digits | c_big_digits |   performance_ratio
------------ | ------------ | ---------------------
432          | 688          | 0.997860
16           | 16           | 1.021872
32           | 1616         | 1.000112
176          | 160          | 0.993211
32           | 96           | 1.016820

The average of these is approx 1.006, which suggests that the initial observations were statistical anomalies.

My conclusion from all of this is that I now feel pretty convinced that the Half-Karatsuba step doesn't cause any measurable performance regressions across any factor size combinations, with significant performance gains for quite a substantial part of the total space of factor size combinations.

cuviper · 2024-04-24T21:26:43Z

That's incredibly thorough, thank you! I am quite satisified. 😄

Another thing about the anomalies -- one of the first things that mac3 does is swap args to ensure x.len() <= y.len(). So we really should be mirrored along that y = x diagonal, and the few points that look otherwise are almost certainly noise. For example, it looks like that purple point is your (432, 688), but if it were real it should have been similar at (688, 432). I'm glad your reruns also support this, or else it would seem rather strange...

joelonsql added 2 commits April 16, 2024 22:19

cuviper reviewed Apr 16, 2024

View reviewed changes

joelonsql added 2 commits April 17, 2024 04:40

Simplify further by eliminating p to avoid allocation

5ff5164

Per suggestion from cuviper

Fix format complaint; remove trailing newline

aea9284

cuviper added this pull request to the merge queue Apr 24, 2024

Merged via the queue into rust-num:master with commit 06b61c8 Apr 24, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise multiplication #295

Optimise multiplication #295

joelonsql commented Apr 16, 2024 •

edited

cuviper left a comment •

edited

cuviper Apr 16, 2024

joelonsql Apr 17, 2024

joelonsql Apr 17, 2024

cuviper commented Apr 17, 2024

joelonsql commented Apr 18, 2024 •

edited

joelonsql commented Apr 22, 2024 •

edited

joelonsql commented Apr 22, 2024

joelonsql commented Apr 22, 2024

joelonsql commented Apr 23, 2024

cuviper commented Apr 24, 2024

Optimise multiplication #295

Optimise multiplication #295

Conversation

joelonsql commented Apr 16, 2024 • edited

cuviper left a comment • edited

Choose a reason for hiding this comment

cuviper Apr 16, 2024

Choose a reason for hiding this comment

joelonsql Apr 17, 2024

Choose a reason for hiding this comment

joelonsql Apr 17, 2024

Choose a reason for hiding this comment

cuviper commented Apr 17, 2024

joelonsql commented Apr 18, 2024 • edited

joelonsql commented Apr 22, 2024 • edited

joelonsql commented Apr 22, 2024

joelonsql commented Apr 22, 2024

joelonsql commented Apr 23, 2024

cuviper commented Apr 24, 2024

joelonsql commented Apr 16, 2024 •

edited

cuviper left a comment •

edited

joelonsql commented Apr 18, 2024 •

edited

joelonsql commented Apr 22, 2024 •

edited