Use SmallVec to optimize for small integer sizes. #210

lemmih · 2021-07-28T01:56:18Z

This is a proof of concept. I'd like to discuss possible designs and pathways to getting something like this merged.

Motivation

Using BigInt with mostly small (<= word-sized) integers is prohibitively expensive. One numerically intense algorithm from rgeometry is 1000 times slower when using BigRational instead of a fixed-precision number.

Performance

num-bigint performs quite well for large input and is competitive with the GMP. Detailed benchmarks can be found here: https://github.com/Lemmih/num-criterion

So why do small integers perform so poorly? Mostly because allocating new memory is significantly more expensive than doing a single addition or multiplication. There are also a few cases where we can use specialized algorithms. For example, there's a particularly fast gcd algorithm that only works on word-sized integers.

SmallVec with two inlined BigDigits

BigUint is a Vec<BigDigit>. As such, it has a size of 3 words in addition to the actual digits. Switching to a SmallVec<[BigDigit; 2]> will not increase the size of a BigUint but allows for two BigDigits to be inlined and thus not require a heap allocation.

Summary of benchmarks:

I've run benchmarks on an M1 Apple MacBook Air and a 3950X AMD desktop. The results I get from these platforms are quite different and I'd love it if other people could help out by running the benchmark on their machines. For instructions, see: https://github.com/Lemmih/num-criterion

GCD: Slightly faster across all bit sizes. This is mainly because I wrote a more efficient version of shr_assign. The previous version interacted poorly with SmallVec.
Mul: Significantly faster for bit sizes <=64. Slightly slower for 128-bit integers. No significant difference for larger integers.
Div: Significantly faster for bit sizes <=128. No significant difference for larger integers.
Add: Significantly faster for bit sizes <=128. No significant difference for larger integers.
Clone: Significantly faster for bit sizes <=128. No significant difference for larger integers.

Other optimizations

Specialize for integers that fit in 128 bits.

Ideally, addition, multiplication, and division for integers that fit in 128 bits would be as fast as cloning. Copying the bytes is significantly slower than doing the arithmetic.

Signed integer addition/subtraction is quite naive. It is comparing the absolute numbers before subtracting/adding. This has to be improved before specialization will have any noticeable effect. Surely we can get within 2x of GMP.

Use better GCD implementation.

The GCD implementation in num-integer is fairly slow and can be significantly improved by changing a line or two. On my system, this leads to a 10x speed improvement for BigInts when the integer fits in 64 bits.

Other oddities

Rug and Ramp are really good at cloning data. How can they clone large integers at more than twice the speed of num-bigint? Aren't we all just calling 'memcpy'?

lemmih · 2021-07-28T02:07:30Z

This PR is the same as the previous one with the exception of a few optimisations. The optimisations weren't necessary; they just proved that the SmallVec backend could be as efficient as Vec.

lemmih · 2021-07-28T02:09:13Z

Oh, and I added back the 'shr' optimisation because the regression was particularly egregious.

lemmih · 2021-08-10T05:41:31Z

@cuviper Thoughts?

cuviper · 2021-08-27T23:47:05Z

How much of a difference does the union make? I prefer that conceptually because we don't increase the size that way, but in past experiments ISTR it had mixed impact on performance overall.

If the union performance is good, then I think I'd prefer having a feature that toggles between Vec or SmallVec (w/ union), where the latter requires MSRV 1.49, but we'd keep the same struct size either way. I think the global type alias would make toggling easy, as long as we're not using any API that happens to be different between them. What do you think?

lemmih · 2021-08-28T07:03:23Z

The performance difference between union and no-union is minor and is dwarfed by the performance boost from avoiding heap allocations.

I'll update the PR to include a feature switch for SmallVec.

lemmih added 3 commits July 28, 2021 09:52

SmallVec backend for BigUint.

51a7148

Fix bad performance with SmallVec in 'shr_assign'.

8a0dcc2

Abstract the SmallVec vector backend.

a0a92a9

Make 'smallvec' an optional dependency with 'union' enabled.

a0523f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use SmallVec to optimize for small integer sizes. #210

Use SmallVec to optimize for small integer sizes. #210

lemmih commented Jul 28, 2021

lemmih commented Jul 28, 2021

lemmih commented Jul 28, 2021

lemmih commented Aug 10, 2021

cuviper commented Aug 27, 2021

lemmih commented Aug 28, 2021 •

edited

Use SmallVec to optimize for small integer sizes. #210

Are you sure you want to change the base?

Use SmallVec to optimize for small integer sizes. #210

Conversation

lemmih commented Jul 28, 2021

Motivation

Performance

SmallVec with two inlined BigDigits

Summary of benchmarks:

Other optimizations

Specialize for integers that fit in 128 bits.

Use better GCD implementation.

Other oddities

lemmih commented Jul 28, 2021

lemmih commented Jul 28, 2021

lemmih commented Aug 10, 2021

cuviper commented Aug 27, 2021

lemmih commented Aug 28, 2021 • edited

lemmih commented Aug 28, 2021 •

edited