Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise multiplication #295

Merged
merged 4 commits into from Apr 24, 2024
Merged

Optimise multiplication #295

merged 4 commits into from Apr 24, 2024

Conversation

joelonsql
Copy link
Contributor

@joelonsql joelonsql commented Apr 16, 2024

Hi maintainers of rust-num/num-bigint,

I've been hacking on PostgreSQL's numeric.c lately, implementing the Karatsuba algorithm [1].

During this work, I realised that when splitting by half the larger factor's size, then if the smaller factor is less than that, its high part will be zero. I realised that a variable that is zero usually means a formula can be simplified, so I tried to plug in the zero high1 in the formula, and voila, that eliminated a multiplication and most of the additions/subtractions, needing only one multiplication and an addition. This simplification felt too obvious to be novel, have looked around a bit but haven't found it so far. Should be mentioned on the Karatsuba Wikipedia or something I think.

I of course checked also Rust's implementation, and it also seems to benefit from this trick, hence this Pull Request.

Would be fun to know what you think.

Benchmark:

-test multiply_0           ... bench:          42 ns/iter (+/- 0)
+test multiply_0           ... bench:          42 ns/iter (+/- 0)

-test multiply_1           ... bench:       4,271 ns/iter (+/- 68)
+test multiply_1           ... bench:       4,242 ns/iter (+/- 164)

-test multiply_2           ... bench:     289,339 ns/iter (+/- 10,532)
+test multiply_2           ... bench:     288,857 ns/iter (+/- 8,715)

-test multiply_3           ... bench:     663,387 ns/iter (+/- 17,381)
+test multiply_3           ... bench:     583,635 ns/iter (+/- 14,997)

-test multiply_4           ... bench:       7,474 ns/iter (+/- 384)
+test multiply_4           ... bench:       6,649 ns/iter (+/- 164)

-test multiply_5           ... bench:      16,376 ns/iter (+/- 254)
+test multiply_5           ... bench:      13,922 ns/iter (+/- 323)

Thanks for great work!

/Joel

[1] https://www.postgresql.org/message-id/flat/7f95163f-2019-4416-a042-6e2141619e5d@app.fastmail.com

test multiply_0           ... bench:          42 ns/iter (+/- 0)
test multiply_1           ... bench:       4,271 ns/iter (+/- 68)
test multiply_2           ... bench:     289,339 ns/iter (+/- 10,532)
test multiply_3           ... bench:     663,387 ns/iter (+/- 17,381)
test multiply_4           ... bench:       7,474 ns/iter (+/- 384)
test multiply_5           ... bench:      16,376 ns/iter (+/- 254)
Introduces Half-Karatsuba in multiplication module for cases where there is a
significant size disparity between factors. Optimizes performance by evenly
splitting the larger factor for more balanced multiplication calculations.

-test multiply_0           ... bench:          42 ns/iter (+/- 0)
+test multiply_0           ... bench:          42 ns/iter (+/- 0)

-test multiply_1           ... bench:       4,271 ns/iter (+/- 68)
+test multiply_1           ... bench:       4,242 ns/iter (+/- 164)

-test multiply_2           ... bench:     289,339 ns/iter (+/- 10,532)
+test multiply_2           ... bench:     288,857 ns/iter (+/- 8,715)

-test multiply_3           ... bench:     663,387 ns/iter (+/- 17,381)
+test multiply_3           ... bench:     583,635 ns/iter (+/- 14,997)

-test multiply_4           ... bench:       7,474 ns/iter (+/- 384)
+test multiply_4           ... bench:       6,649 ns/iter (+/- 164)

-test multiply_5           ... bench:      16,376 ns/iter (+/- 254)
+test multiply_5           ... bench:      13,922 ns/iter (+/- 323)
Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This does seem like a very straightforward improvement -- almost too obvious (like you said), but I can't find any fault with it. 😄

// Add temp shifted by m2 to the accumulator
// This simulates the effect of multiplying temp by b^m2.
// Add directly starting at index m2 in the accumulator.
add2(&mut acc[m2..], &p.data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each product is immediately added to acc, I think we don't even need the p buffer at all:

        mac3(acc, x, low2);
        mac3(&mut acc[m2..], x, high2);

It doesn't change the benchmark times for me, but still, it's an easy allocation to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Very nice. Thanks, I've pushed the simplification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the format error due to trailing newline, fixed and pushed.

@cuviper
Copy link
Member

cuviper commented Apr 17, 2024

I suspect the reason this isn't normally considered is that it seems to add a layer of recursive multiplications. That is, this "half" step recurses to two "full" Karatsuba steps with the same x.len(), though halved y.len() in each. Maybe the reduced y still works in our favor in further recursion, or perhaps it has less obvious algorithmic effects like cache locality. Your new step would also intercept larger sizes that were headed for the Toom-3 branch, so maybe it's helping more there.

Whatever it is, the benchmarks do favor this change. I'm just trying to think if there are untested scenarios that this will make worse.

@joelonsql
Copy link
Contributor Author

joelonsql commented Apr 18, 2024

The pseudo-code in the Wikipedia article on Karatsuba for some reason splits on the longer factor:

    /* Calculates the size of the numbers. */
    m = max(size_base10(num1), size_base10(num2))
    m2 = floor(m / 2) 

I noticed this code:

        // When x is smaller than y, it's significantly faster to pick b such that x is split in
        // half, not y:
        let b = x.len() / 2;

It seems that, with the Half-Karatsuba step, it doesn't matter if we in the Full-Karatsuba step, split x or y in half, I get almost identical benchmark results, to a degree that it is hard to tell which is the fastest.

The reason why Full-Karatsuba, without the half-step, when splitting on y is slower than x, is that one of the multiplications will be meaningless, since the x0 will be zero.

I think splitting the longer y in half-step(s), leads to fewer and more balanced smaller multiplications, than if instead splitting on the shorter x and do a full Karatsuba-step.

I've done quite a lot of benchmarks on my PostgreSQL numeric mul_var() patch, but not for num-bigint, so it would be good with a benchmark that covers a large range of smaller factors times a large range of larger factors, to better understand the effect for different factor lengths and factor length ratios, on different architectures.

In my PostgreSQL patch benchmark, I've measured the performance_ratio as defined as the execution time for the current implementation divided by the execution time for the patched version, by exposing both implementations in the same runtime, to allow increasing the number of executions exponentially, until the performance_ratio converge.

I think it would be fun to work together on developing a very capable benchmark suite for num-bigint, if you're up for it. For inspiration, I've attached a plot I created for my PostgreSQL patch, using Dynamic Programming to find the optimal threshold function, which as you can see is not trivial. The threshold line goes between the blue and magenta areas. The black line segment is a manually created threshold line, which I think captures the interesting performance gain area, without causing any significant performance regressions on the most prioritised architectures.

image

Important: The above plot is not for this patch, it's from my separate PostgreSQL patch, included here just for inspiration.

@joelonsql
Copy link
Contributor Author

joelonsql commented Apr 22, 2024

Hi @cuviper,

I've now done benchmarks across varying sizes of BigUint. Here's what I found: when the factor sizes are identical, there's no noticeable difference in the performance ratio. It's when we introduce a length disparity between factors that the performance ratio starts to climb, highlighting the Half Karatsuba's efficiency in handling size asymmetry.

Interestingly, keeping the larger factor constant and varying the smaller factor shows that the performance_ratio increases with the length disparity, but only to a certain point. Beyond this point, it begins to taper off, until it approaches 1.0 when factor sizes are equal.

This seems encouraging. Eager to hear your thoughts.

The benchmark below was produced using https://github.com/joelonsql/rust-timeit/blob/master/examples/num_bigint_half_karatsuba.rs

image

As shown in the plot, b_exp and c_exp have been varied from from 1 to 17.

@joelonsql
Copy link
Contributor Author

Another benchmark with factors up to 1 << 24 bit_size

image

@joelonsql
Copy link
Contributor Author

Here is another benchmark with 256 x 256 measurements, on a linear scale, and with the number of big_digits as the axes.

image

@joelonsql
Copy link
Contributor Author

@cuviper wrote:

I'm just trying to think if there are untested scenarios that this will make worse.

To address this, I've examined the data points, with the lowest performance ratios from the last 49152 measurements:

b_big_digits | c_big_digits | performance_ratio
------------ | ------------ | -----------------
432          | 688          | 0.813122
16           | 16           | 0.956026
32           | 1616         | 0.972652
176          | 160          | 0.976072
32           | 96           | 0.977745

I then reran these tests with higher precision and additional runs to ensure reliability.

Here are the refined results:

b_big_digits | c_big_digits |   performance_ratio
------------ | ------------ | ---------------------
432          | 688          | 0.997860
16           | 16           | 1.021872
32           | 1616         | 1.000112
176          | 160          | 0.993211
32           | 96           | 1.016820

The average of these is approx 1.006, which suggests that the initial observations were statistical anomalies.

My conclusion from all of this is that I now feel pretty convinced that the Half-Karatsuba step doesn't cause any measurable performance regressions across any factor size combinations, with significant performance gains for quite a substantial part of the total space of factor size combinations.

@cuviper
Copy link
Member

cuviper commented Apr 24, 2024

That's incredibly thorough, thank you! I am quite satisified. 😄

Another thing about the anomalies -- one of the first things that mac3 does is swap args to ensure x.len() <= y.len(). So we really should be mirrored along that y = x diagonal, and the few points that look otherwise are almost certainly noise. For example, it looks like that purple point is your (432, 688), but if it were real it should have been similar at (688, 432). I'm glad your reruns also support this, or else it would seem rather strange...

@cuviper cuviper added this pull request to the merge queue Apr 24, 2024
Merged via the queue into rust-num:master with commit 06b61c8 Apr 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants