internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

greatroar · 2022-01-30T19:25:23Z

Another optimization for the amd64 decoder, inspired by one of its comments:

name                old speed      new speed      delta
UncompressPg1661-8  1.15GB/s ± 1%  1.19GB/s ± 1%   +3.39%  (p=0.000 n=10+10)
UncompressDigits-8  1.89GB/s ± 0%  2.33GB/s ± 1%  +23.46%  (p=0.000 n=9+10)
UncompressTwain-8   1.19GB/s ± 1%  1.23GB/s ± 0%   +3.43%  (p=0.000 n=10+10)
UncompressRand-8    3.93GB/s ± 2%  3.96GB/s ± 1%     ~     (p=0.105 n=10+10)

The effect is most pronounced on Digits because 37.4% of its literals have lengths 17-48. In Twain and Pg1661, this is <4.1%.

This is faster than copying 32 bytes. At 64 bytes, digits gets faster still whlie Twain and Pg1661 get slightly slower.

name old speed new speed delta UncompressPg1661-8 1.15GB/s ± 1% 1.19GB/s ± 1% +3.39% (p=0.000 n=10+10) UncompressDigits-8 1.89GB/s ± 0% 2.33GB/s ± 1% +23.46% (p=0.000 n=9+10) UncompressTwain-8 1.19GB/s ± 1% 1.23GB/s ± 0% +3.43% (p=0.000 n=10+10) UncompressRand-8 3.93GB/s ± 2% 3.96GB/s ± 1% ~ (p=0.105 n=10+10) The effect is most pronounced on Digits because 37.4% of its literals have lengths 17-48. In Twain and Pg1661, this is <4.1%. This is faster than copying 32 bytes. At 64 bytes, digits gets faster still whlie Twain and Pg1661 get slightly slower.

greatroar force-pushed the amd64-match-copy-48 branch from 04f2583 to 257c664 Compare January 30, 2022 19:44

greatroar changed the title ~~internal/lz4block: Copy literals of <=48 bytes through XMM~~ internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder Jan 30, 2022

pierrec merged commit 6bd757c into pierrec:v4 Jan 31, 2022

greatroar deleted the amd64-match-copy-48 branch January 31, 2022 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

greatroar commented Jan 30, 2022

internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

internal/lz4block: Copy literals of <=48 bytes through XMM registers in amd64 decoder #161

Conversation

greatroar commented Jan 30, 2022