Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SmallVec to optimize for small integer sizes. #210

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lemmih
Copy link

@lemmih lemmih commented Jul 28, 2021

This is a proof of concept. I'd like to discuss possible designs and pathways to getting something like this merged.

Motivation

Using BigInt with mostly small (<= word-sized) integers is prohibitively expensive. One numerically intense algorithm from rgeometry is 1000 times slower when using BigRational instead of a fixed-precision number.

Performance

num-bigint performs quite well for large input and is competitive with the GMP. Detailed benchmarks can be found here: https://github.com/Lemmih/num-criterion

So why do small integers perform so poorly? Mostly because allocating new memory is significantly more expensive than doing a single addition or multiplication. There are also a few cases where we can use specialized algorithms. For example, there's a particularly fast gcd algorithm that only works on word-sized integers.

SmallVec with two inlined BigDigits

BigUint is a Vec<BigDigit>. As such, it has a size of 3 words in addition to the actual digits. Switching to a SmallVec<[BigDigit; 2]> will not increase the size of a BigUint but allows for two BigDigits to be inlined and thus not require a heap allocation.

Summary of benchmarks:

I've run benchmarks on an M1 Apple MacBook Air and a 3950X AMD desktop. The results I get from these platforms are quite different and I'd love it if other people could help out by running the benchmark on their machines. For instructions, see: https://github.com/Lemmih/num-criterion

  • GCD: Slightly faster across all bit sizes. This is mainly because I wrote a more efficient version of shr_assign. The previous version interacted poorly with SmallVec.
  • Mul: Significantly faster for bit sizes <=64. Slightly slower for 128-bit integers. No significant difference for larger integers.
  • Div: Significantly faster for bit sizes <=128. No significant difference for larger integers.
  • Add: Significantly faster for bit sizes <=128. No significant difference for larger integers.
  • Clone: Significantly faster for bit sizes <=128. No significant difference for larger integers.

Other optimizations

Specialize for integers that fit in 128 bits.

Ideally, addition, multiplication, and division for integers that fit in 128 bits would be as fast as cloning. Copying the bytes is significantly slower than doing the arithmetic.

  • Signed integer addition/subtraction is quite naive. It is comparing the absolute numbers before subtracting/adding. This has to be improved before specialization will have any noticeable effect. Surely we can get within 2x of GMP.

Use better GCD implementation.

The GCD implementation in num-integer is fairly slow and can be significantly improved by changing a line or two. On my system, this leads to a 10x speed improvement for BigInts when the integer fits in 64 bits.

Other oddities

  • Rug and Ramp are really good at cloning data. How can they clone large integers at more than twice the speed of num-bigint? Aren't we all just calling 'memcpy'?

@lemmih
Copy link
Author

lemmih commented Jul 28, 2021

This PR is the same as the previous one with the exception of a few optimisations. The optimisations weren't necessary; they just proved that the SmallVec backend could be as efficient as Vec.

@lemmih
Copy link
Author

lemmih commented Jul 28, 2021

Oh, and I added back the 'shr' optimisation because the regression was particularly egregious.

@lemmih
Copy link
Author

lemmih commented Aug 10, 2021

@cuviper Thoughts?

@cuviper
Copy link
Member

cuviper commented Aug 27, 2021

How much of a difference does the union make? I prefer that conceptually because we don't increase the size that way, but in past experiments ISTR it had mixed impact on performance overall.

If the union performance is good, then I think I'd prefer having a feature that toggles between Vec or SmallVec (w/ union), where the latter requires MSRV 1.49, but we'd keep the same struct size either way. I think the global type alias would make toggling easy, as long as we're not using any API that happens to be different between them. What do you think?

@lemmih
Copy link
Author

lemmih commented Aug 28, 2021

The performance difference between union and no-union is minor and is dwarfed by the performance boost from avoiding heap allocations.

I'll update the PR to include a feature switch for SmallVec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants