Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster #2278

Merged
merged 6 commits into from Aug 3, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 2, 2022

Which issue does this PR close?

Closes #2276

Rationale for this change

Eliminates a load of unsafe code, and allows unpacking into the native type instead of always having to go via u32.

I'm not confident the generated code is necessarily optimal, but godbolt would suggest that LLVM at least takes a decent stab at using the bit-shuffling instructions - https://rust.godbolt.org/z/1aMdKT8qc

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 2, 2022
@tustvold tustvold marked this pull request as ready for review August 2, 2022 16:45
@tustvold
Copy link
Contributor Author

tustvold commented Aug 2, 2022

Roughly ~10% performance improvement, which isn't bad given half the objective was cleaning up the code with an eventual view to implementing #2257. The major return for DeltaBitPackDecoder would likely only be realizable with #2282 and larger miniblock sizes.

Performance vs master

arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [4.8459 us 4.8589 us 4.8732 us]
                        change: [+0.3572% +0.8566% +1.3081%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                                                                             
                        time:   [22.337 us 22.347 us 22.358 us]
                        change: [-1.0612% -0.7802% -0.5283%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                                                                             
                        time:   [28.798 us 28.815 us 28.833 us]
                        change: [-13.045% -12.928% -12.803%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [33.001 us 33.015 us 33.031 us]
                        change: [-18.056% -18.023% -17.989%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [50.711 us 50.729 us 50.747 us]
                        change: [-12.294% -12.252% -12.205%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [43.718 us 43.753 us 43.792 us]
                        change: [-15.172% -15.098% -15.016%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [33.403 us 33.410 us 33.419 us]
                        change: [-7.3073% -7.2057% -7.0230%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [51.100 us 51.116 us 51.133 us]
                        change: [-4.6051% -4.5621% -4.5166%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [44.215 us 44.235 us 44.255 us]
                        change: [-9.9963% -9.9351% -9.8587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [8.0967 us 8.1230 us 8.1534 us]
                        change: [+0.7474% +1.1695% +1.5384%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                                                                             
                        time:   [25.602 us 25.631 us 25.665 us]
                        change: [-1.4309% -1.0332% -0.6523%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                                                                             
                        time:   [31.643 us 31.672 us 31.700 us]
                        change: [+14.966% +15.048% +15.132%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  10 (10.00%) high mild
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                            
                        time:   [62.860 us 62.874 us 62.891 us]
                        change: [-1.3340% -1.2969% -1.2569%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                            
                        time:   [80.933 us 80.982 us 81.059 us]
                        change: [-0.8099% -0.7311% -0.6512%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                            
                        time:   [58.719 us 58.774 us 58.832 us]
                        change: [+6.0864% +6.1605% +6.2268%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [35.035 us 35.049 us 35.065 us]
                        change: [-12.831% -12.790% -12.750%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                                                                            
                        time:   [52.687 us 52.703 us 52.720 us]
                        change: [-8.9727% -8.9393% -8.9011%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [46.217 us 46.235 us 46.255 us]
                        change: [+4.6787% +4.7709% +4.8563%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                                                                            
                        time:   [175.32 us 175.44 us 175.56 us]
                        change: [-3.4189% -3.2578% -3.0783%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                                                                            
                        time:   [194.47 us 194.59 us 194.74 us]
                        change: [-3.1158% -3.0118% -2.8869%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                                                                            
                        time:   [205.21 us 205.27 us 205.34 us]
                        change: [-6.1702% -6.0441% -5.9480%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                                                                            
                        time:   [133.67 us 133.89 us 134.13 us]
                        change: [-4.3816% -4.2028% -4.0310%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                                                                            
                        time:   [153.30 us 153.54 us 153.79 us]
                        change: [-1.8329% -1.6595% -1.5029%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs                                                                            
                        time:   [183.34 us 183.49 us 183.67 us]
                        change: [-6.1044% -6.0061% -5.8991%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [26.463 us 26.471 us 26.480 us]
                        change: [-0.4894% -0.4345% -0.3755%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs                                                                             
                        time:   [43.910 us 43.924 us 43.941 us]
                        change: [-0.0280% +0.0426% +0.1125%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs                                                                             
                        time:   [49.736 us 49.766 us 49.803 us]
                        change: [+3.9807% +4.0557% +4.1406%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

For completeness, performance vs master with -C target-cpu=native. So only ~5% but not a regression which is the important thing.

arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [4.7553 us 4.7598 us 4.7661 us]
                        change: [-0.0821% +0.7986% +1.5839%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                                                                             
                        time:   [21.556 us 21.620 us 21.690 us]
                        change: [-0.9754% -0.5451% -0.0853%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                                                                             
                        time:   [28.930 us 28.942 us 28.957 us]
                        change: [-12.993% -12.935% -12.879%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [30.420 us 30.429 us 30.440 us]
                        change: [+2.5445% +2.7910% +2.9347%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [47.153 us 47.164 us 47.175 us]
                        change: [+1.0496% +1.4538% +1.8614%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [42.822 us 42.856 us 42.897 us]
                        change: [-7.1747% -6.9019% -6.5972%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [30.892 us 30.899 us 30.905 us]
                        change: [-9.6399% -9.4767% -9.2311%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [47.705 us 47.719 us 47.733 us]
                        change: [-6.2662% -6.0597% -5.9385%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [43.416 us 43.432 us 43.449 us]
                        change: [-9.5642% -9.3380% -9.2034%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [7.9014 us 7.9765 us 8.0715 us]
                        change: [-2.1869% -1.2440% -0.1997%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                                                                             
                        time:   [25.072 us 25.171 us 25.303 us]
                        change: [-3.4804% -2.0406% -0.7844%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                                                                             
                        time:   [30.868 us 30.886 us 30.907 us]
                        change: [-10.822% -10.526% -10.232%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                            
                        time:   [62.747 us 62.765 us 62.787 us]
                        change: [+0.1408% +0.4562% +0.7639%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                            
                        time:   [79.580 us 79.602 us 79.628 us]
                        change: [+0.4857% +0.6106% +0.8242%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  9 (9.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                            
                        time:   [57.406 us 57.415 us 57.424 us]
                        change: [-8.3594% -8.0473% -7.7786%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [32.319 us 32.323 us 32.328 us]
                        change: [-9.0431% -9.0036% -8.9690%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [49.277 us 49.289 us 49.305 us]
                        change: [-5.9239% -5.6357% -5.3767%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [44.122 us 44.141 us 44.163 us]
                        change: [-10.843% -10.591% -10.341%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                                                                            
                        time:   [176.52 us 176.59 us 176.65 us]
                        change: [-7.9276% -7.8375% -7.7034%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                                                                            
                        time:   [194.73 us 194.83 us 194.94 us]
                        change: [-8.9202% -8.6621% -8.4009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  12 (12.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                                                                            
                        time:   [218.35 us 218.44 us 218.55 us]
                        change: [-3.4216% -3.1171% -2.7915%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                                                                            
                        time:   [135.36 us 135.61 us 135.90 us]
                        change: [-2.4626% -1.9884% -1.5822%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                                                                            
                        time:   [153.32 us 153.48 us 153.67 us]
                        change: [-1.2992% -1.0836% -0.7954%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs                                                                            
                        time:   [197.32 us 197.47 us 197.64 us]
                        change: [+1.4295% +1.6590% +1.8209%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [25.246 us 25.256 us 25.265 us]
                        change: [-3.9833% -3.8835% -3.7605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs                                                                             
                        time:   [42.021 us 42.028 us 42.037 us]
                        change: [-2.7292% -2.4821% -2.3167%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs                                                                             
                        time:   [48.923 us 48.940 us 48.959 us]
                        change: [+4.7617% +5.0899% +5.3886%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

@tustvold tustvold changed the title Add unpack8, unpack16, unpack64 (#2276) Add unpack8, unpack16, unpack64 (#2276) ~10% faster Aug 2, 2022
@tustvold
Copy link
Contributor Author

tustvold commented Aug 2, 2022

Running the benchmarks with -C target-cpu=native and the block size of DeltaBitPackEncoder bumped up to 1024 from 128 and we get a 50% performance improvement for 64-bit integers

arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [23.948 us 23.964 us 23.981 us]
                        change: [-5.5829% -5.2823% -4.9443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [40.763 us 40.771 us 40.781 us]
                        change: [-3.8441% -3.5619% -3.2927%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [48.342 us 48.366 us 48.391 us]
                        change: [+9.3416% +9.6807% +10.033%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [23.154 us 23.162 us 23.172 us]
                        change: [-56.875% -56.752% -56.626%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                             
                        time:   [39.882 us 39.890 us 39.898 us]
                        change: [-43.358% -43.288% -43.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  9 (9.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                             
                        time:   [48.638 us 48.670 us 48.704 us]
                        change: [-16.193% -15.929% -15.647%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

I wonder if we should just bump the default block size to 256 for 64-bit integers 🤔

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sunchao This new bit pack utils look useful to us.

)
};

unroll!(for i in 0..$bits {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM was very reticent to unroll this loop, this is not all that surprising given it isn't immediately obvious how much it collapses down. We therefore give it a helping hand 😄


//! Vectorised bit-packing utilities

macro_rules! unroll_impl {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of gross, but macros can't actually do maths, so we "emulate" it by building an expression tree of "1 + 1 + ...". This will all get compiled away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)

Starting with inlining the comments from the PR is probably a good start :)

@tustvold tustvold changed the title Add unpack8, unpack16, unpack64 (#2276) ~10% faster Add unpack8, unpack16, unpack64 (#2276) ~10%-50% faster Aug 2, 2022
@tustvold tustvold changed the title Add unpack8, unpack16, unpack64 (#2276) ~10%-50% faster Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster Aug 2, 2022
unroll_impl!(1, $offset + 1, $v, $c);
}};
(4, $offset:expr, $v:ident, $c:block) => {{
unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, where does __v4 comes? Seems not macro parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an identifier created by this macro instantiation, and then passed into the child. This was the only way I could work out to do arithmetic in a macro, it's kind of wild 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference this is what it expands to

unroll!(for i in 0..4 {
  vec.push(i);
});
#[allow(non_upper_case_globals)]
{
    {
        {
            {
                const __v4: usize = 0;
                {
                    {
                        {
                            const i: usize = (__v4 * 2 + 0);
                            {
                                vec.push(i);
                            }
                        }
                        {
                            const i: usize = ((__v4 * 2 + 0) + 1);
                            {
                                vec.push(i);
                            }
                        }
                    }
                }
            }
            {
                const __v4: usize = (0 + 1);
                {
                    {
                        {
                            const i: usize = (__v4 * 2 + 0);
                            {
                                vec.push(i);
                            }
                        }
                        {
                            const i: usize = ((__v4 * 2 + 0) + 1);
                            {
                                vec.push(i);
                            }
                        }
                    }
                }
            }
        }
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding some comments for future reference :)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great. Really nice work @tustvold I reviewed the changes and test coverage and 👍

I especially like that this deletes code and makes it faster:

Screen Shot 2022-08-02 at 2 01 15 PM

I think we should wait a few days before merging this to let @sunchao and/or @nevi-me review, if they would like


//! Vectorised bit-packing utilities

macro_rules! unroll_impl {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)

Starting with inlining the comments from the PR is probably a good start :)

use super::*;
use rand::{thread_rng, Rng};

#[inline(never)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops hold over from some godbolt shenanigans, will remove

///
/// - force unrolling of a loop so that LLVM optimises it properly
/// - call a const generic function with the loop value
macro_rules! unroll {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example might help to document this -- btw the use of matching patterns for unrolling i in 0..end is quite neat

let val = r(end_byte);
let b = val << (NUM_BITS - end_bit_offset);

output[i] = a | (b & mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could somehow remove the output[i] bounds check too? Perhaps with an iterator or append or something 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As output is a fixed length slice, the bounds check is automatically elided

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good call, I was confusing that with input: &[u8] 🤦

_ => unreachable!(),
}

// Try to read smaller batches if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Try to read smaller batches if possible
// Try to read remainder, if any, in batches if possible

@@ -476,17 +477,6 @@ impl BitReader {

let mut i = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't clear to me what the num_bits input parameter means (is it the number of bits to read after batch.len() whole T values?) -- can you possibly update the comments too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally renamed this to bit_width, but we seem to use num_bits as the name for this concept in a lot of places. I've updated the doc comment

match i {
0..=8 => test_get_batch_helper::<u8>(*s, i),
9..=16 => test_get_batch_helper::<u16>(*s, i),
_ => test_get_batch_helper::<u32>(*s, i),
17..=32 => test_get_batch_helper::<u32>(*s, i),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great to reduce the code a lot. Thanks for the work. It is nicer if we can add some comments as @alamb suggested, as at first look, the macros look complicated.

@sunchao
Copy link
Member

sunchao commented Aug 2, 2022

Thanks, this looks very nice. I'll take a look too today. BTW I think it's possible to apply SIMD on this code path to further improve the performance. Arrow C++ already did it, see https://issues.apache.org/jira/browse/ARROW-9702 and related.

@tustvold
Copy link
Contributor Author

tustvold commented Aug 2, 2022

BTW I think it's possible to apply SIMD on this code path to further improve the performance

The motivation for the manual unrolling of the loops is so that LLVM does this for us. If you check the godbolt output linked in the PR you'll see it is using AVX2 instructions. This fits with what we've seen in general with the arrow compute kernels, where only AVX512 seems to need manual implementation.

Comment on lines 35 to 39
(16, $offset:expr, $v:ident, $c:block) => {{
unroll_impl!(4, 0, __v16, {
unroll_impl!(4, __v16 * 4 + $offset, $v, $c)
});
}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use unroll_impl!(8, ...) here?

/// Macro that generates an unpack function taking the number of bits as a const generic
macro_rules! unpack_impl {
($t:ty, $bytes:literal, $bits:tt) => {
pub fn unpack<const NUM_BITS: usize>(input: &[u8], output: &mut [$t; $bits]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 when the original version was written const generic is not stabilized yet.

unroll_impl!(1, $offset + 1, $v, $c);
}};
(4, $offset:expr, $v:ident, $c:block) => {{
unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding some comments for future reference :)

/// Unpack packed `input` into `output` with a bit width of `num_bits`
pub fn $name(input: &[u8], output: &mut [$t; $bits], num_bits: usize) {
// This will get optimised into a jump table
unroll!(for i in 0..$bits {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it might be worth checkouting https://docs.rs/seq-macro/latest/seq_macro/, the generated output is easier to understand.

For example the output of unpack8 becomes:

        pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) {
            if 0 == num_bits { return unpack8::unpack::<0>(input, output); }
            if 1 == num_bits { return unpack8::unpack::<1>(input, output); }
            if 2 == num_bits { return unpack8::unpack::<2>(input, output); }
            if 3 == num_bits { return unpack8::unpack::<3>(input, output); }
            if 4 == num_bits { return unpack8::unpack::<4>(input, output); }
            if 5 == num_bits { return unpack8::unpack::<5>(input, output); }
            if 6 == num_bits { return unpack8::unpack::<6>(input, output); }
            if 7 == num_bits { return unpack8::unpack::<7>(input, output); };
            if num_bits == 8 { return unpack8::unpack::<8>(input, output); }
            ::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "],
                    &[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "],
                                            &[::core::fmt::ArgumentV1::new_display(&num_bits)]))]));
        }

instead of

        /// Unpack packed `input` into `output` with a bit width of `num_bits`
        pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) {

            #[allow(non_upper_case_globals)]
            {
                {
                    {
                        {
                            const __v8: usize = 0;
                            {
                                {
                                    {
                                        {
                                            const __v4: usize = 0;
                                            {
                                                {
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0);
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                }
                                            }
                                        };
                                        {
                                            const __v4: usize = 0 + 1;
                                            {
                                                {
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0);
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                }
                                            }
                                        };
                                    };
                                }
                            }
                        };
                        {
                            const __v8: usize = 0 + 1;
                            {
                                {
                                    {
                                        {
                                            const __v4: usize = 0;
                                            {
                                                {
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0);
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                }
                                            }
                                        };
                                        {
                                            const __v4: usize = 0 + 1;
                                            {
                                                {
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0);
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                    {
                                                        const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1;
                                                        {
                                                            if i == num_bits {
                                                                    return unpack8::unpack::<i>(input, output);
                                                                }
                                                        }
                                                    };
                                                }
                                            }
                                        };
                                    };
                                }
                            }
                        };
                    };
                }
            };
            if num_bits == 8 { return unpack8::unpack::<8>(input, output); }
            ::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "],
                    &[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "],
                                            &[::core::fmt::ArgumentV1::new_display(&num_bits)]))]));
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you may no longer need the hack in unroll_impl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid introducing a new dependency, as that has historically been controversial

@jhorstmann
Copy link
Contributor

I think in this case using a dependency could be preferrable and lead to even faster code. A macro relying on implementation details of how identifiers are generated is indeed a bit ugly. I know that parquet2 is using the bitpacking crate, which uses optimized vectorized implementations without relying too much on llvm.

@tustvold
Copy link
Contributor Author

tustvold commented Aug 3, 2022

I looked at the bitpacking crate but it only supports the equivalent of unpack32 which it achieves in much the same way relying on LLVM to optimise it. I'll switch this to using seq_macro as unroll appears to be the major sticking point

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say 🚢 🇮🇹 and I will keep my 🤐 about adding new dependencies -- I think seq is ok given what I see of the https://crates.io/crates/seq-macro

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. seq-macro doesn't carry any transitive dependency and it seems well maintained, so I think it's fine to add it.

@tustvold tustvold merged commit 8a092e3 into apache:master Aug 3, 2022
@tustvold
Copy link
Contributor Author

tustvold commented Aug 3, 2022

Thank you all for reviewing and for the feedback 😃

@ursabot
Copy link

ursabot commented Aug 3, 2022

Benchmark runs are scheduled for baseline = 2cf4cd8 and contender = 8a092e3. 8a092e3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold
Copy link
Contributor Author

tustvold commented Aug 4, 2022

#2319 bumps the defaults to make the most of this change, PTAL 😄

jorgecarleitao added a commit to jorgecarleitao/parquet2 that referenced this pull request Aug 15, 2022
This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀
Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking.
It also adds some traits that allows code to be written via generics.
A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉
Backward changes:

renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked
parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type)
parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize
jorgecarleitao added a commit to jorgecarleitao/parquet2 that referenced this pull request Aug 15, 2022
This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀
Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking.
It also adds some traits that allows code to be written via generics.
A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉
Backward changes:

renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked
parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type)
parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add vectorized unpacking for 8, 16, and 64 bit integers
7 participants