Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SIMD implementations if/when std::simd is stable #83

Open
akhilles opened this issue Feb 4, 2023 · 4 comments
Open

Add SIMD implementations if/when std::simd is stable #83

akhilles opened this issue Feb 4, 2023 · 4 comments

Comments

@akhilles
Copy link
Collaborator

akhilles commented Feb 4, 2023

Blocked by rust-lang/rust#86656.

@snakehand
Copy link

If you want to get significant speedup with SIMD, (especially on x86) you should implement the algorithm using carry-less multiplications.

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

I can help giving pointers how to do this practically.

@KillingSpark
Copy link
Contributor

KillingSpark commented Dec 6, 2023

Hi @snakehand I've been thinking about this a bit, sadly the paper you linked isn't available anymore (at least not under the link and a quick google only turns up a whitepaper about carry less mult for galois counter mode) but am I right if I think the carry-less multiplication would be used somewhat like this? (Ignoring reflection etc etc for conciseness)

fn crc(poly: u64, crc: u64, bytes: &[u8]) -> u64 {
    let mut idx = 0;
    while bytes.len() - idx >= 8 {
        let next_data = load_u64(bytes, idx);
        let multiplicated: u128 = carry_less_mult(crc ^ next_data, poly);
        // Question: Are the higher bits of any significance anymore? 
        // They are equivalent to what we shift out / throw away in the "normal" implementations right?
        crc = lower_bits(multiplicated);
        idx += 8;
    }
    // deal with remainder
    crc
}

@snakehand
Copy link

snakehand commented Dec 6, 2023

The document is available here :

https://github.com/tpn/pdfs/blob/master/Fast%20CRC%20Computation%20for%20Generic%20Polynomials%20Using%20PCLMULQDQ%20Instruction%20-%20Intel%20(December%2C%202009).pdf

The speedup comes from using the carryless multiplication in bigger data units, and using Barett reduction to compute the final smaller CRC.

@KillingSpark
Copy link
Contributor

Started doing preliminary work here, no simd yet just understanding the algorithm: https://github.com/KillingSpark/crc-rs/tree/clmul

Interestingly enough this is ~2x faster than the current table-less implementation even without any real thought on optimization and especially with the lack of any simd. Might be worth using this even if the simd instructions aren't available for a specific target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants