Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement number theoretic transform for large integer multiplication #282

Open
wants to merge 73 commits into
base: master
Choose a base branch
from

Conversation

byeongkeunahn
Copy link

@byeongkeunahn byeongkeunahn commented Aug 28, 2023

This commit implements number theoretic transform (NTT) for large integer multiplication (issue #169).

  • To simplify implementation the Schönhage–Strassen algorithm was not used. Instead, three distinct 64-bit primes were carefully chosen to enable NTT up to ~10^18 64bit integers, which allows multiplication up to ~5 x 10^17 64bit integers. Depending on the input length either two or three primes are used, with the latter only used when the inputs consist of at least 2^40 64bit integers. The convolution results modulo primes are merged using the Chinese Remainder Theorem.
  • To reduce padding and the number of cycles for NTT, multiple radices are used (radix-2, radix-3, radix-4, radix-5, radix-6). Radix-8 is desirable but actually slower, presumably due to register spill. Although manual SIMD coding may alleviate this issue, it was not used (i) for maximum portability and (ii) since 64bit SIMD multiply is not widely available even on x86/x64 platforms (AVX512). The prime moduli are carefully chosen to support these radices.
    • The NTT length is selected by exhaustively evaluating cost estimates for all allowed lengths within a factor of two.
  • Single-word (u64) Montgomery reduction is used for fast modular multiplication.
  • 32bit digits are supported by repacking the digits into u64, running the u64 algorithm, and converting back to u32. This results in 32bit builds being about 3-5x slower compared to 64bit builds, which, however, still is an improvement upon the existing algorithms.
  • Unbalanced multiplication is enabled when the cost estimates are favorable.
  • Based on experimentation, the following thresholds are chosen:
    • For u64 digits, switch to NTT if the shorter integer has at least 512 digits.
    • For u32 digits, switch to NTT if the shorter integer has at least 2,048 digits.
  • The three primes are as follows:
    • P1 = 10_237_243_632_176_332_801, Max NTT length = 2^24 * 3^20 * 5^2 = 1_462_463_376_025_190_400
    • P2 = 13_649_658_176_235_110_401, Max NTT length = 2^26 * 3^19 * 5^2 = 1_949_951_168_033_587_200
    • P3 = 14_259_017_916_245_606_401, Max NTT length = 2^22 * 3^21 * 5^2 = 1_096_847_532_018_892_800
    • P1 and P2 are used for the two-prime NTT, whereas the three-prime NTT uses all three.
  • The following MIT-licensed projects are used as reference:

On Ryzen 7 2700X, 64bit, it takes about 15ms for 2.7Mbits x 2.7Mbits and 170ms for 27Mbits x 27Mbits multiplication. This seems comparable to GMP 6.2.1.

benchmark-20230829

Previously 3 primes were used, which was suboptimal in terms of speed. Currently, the threshold for switching from 2 to 3 primes is 2^38.
Despite the simple implementation with obvious inefficiencies (e.g., not reusing the NTT of the shorter array), this leads to speed gains in multiple benchmarks, although there is a small regression in others.
@byeongkeunahn byeongkeunahn changed the title Implement number theroetic transform for large integer multiplication Implement number theoretic transform for large integer multiplication Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants