Branchless varint decoding #14050

franz1981 · 2024-05-12T19:41:10Z

Motivation:

varint decoding fall shortly when input length is not predictable

Modification:

Implements branchless and batchy varint decoding

Result:

Better performance when inputs length is not predictable

franz1981 · 2024-05-12T19:45:25Z

These are the results of the benchmark on my Ryzen box

Benchmark                                   (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.oldReadRawVarint32                SMALL         1  avgt   10  1.547 ± 0.021  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                SMALL       128  avgt   10  2.460 ± 0.023  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                SMALL    128000  avgt   10  6.519 ± 0.092  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                LARGE         1  avgt   10  3.936 ± 0.014  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM         1  avgt   10  2.823 ± 0.022  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM       128  avgt   10  2.791 ± 0.029  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM    128000  avgt   10  7.799 ± 0.026  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL         1  avgt   10  2.812 ± 0.043  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL       128  avgt   10  2.847 ± 0.032  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL    128000  avgt   10  8.094 ± 0.015  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL         1  avgt   10  1.672 ± 0.040  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL       128  avgt   10  2.547 ± 0.047  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL    128000  avgt   10  6.572 ± 0.063  ns/op
VarintDecodingBenchmark.readRawVarint32                   LARGE         1  avgt   10  3.724 ± 0.016  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM         1  avgt   10  3.348 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM       128  avgt   10  3.472 ± 0.035  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM    128000  avgt   10  3.457 ± 0.051  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL         1  avgt   10  3.378 ± 0.041  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL       128  avgt   10  3.877 ± 0.041  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL    128000  avgt   10  8.413 ± 0.092  ns/op

The interesting ones are the the MEDIUM which assume to have >= 4 bytes to read (which is very likely), but a fair distribution of variable length varint (1, 2, 3, 4 bytes) .

Benchmark                                   (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM         1  avgt   10  2.823 ± 0.022  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM       128  avgt   10  2.791 ± 0.029  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM    128000  avgt   10  7.799 ± 0.026  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM         1  avgt   10  3.348 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM       128  avgt   10  3.472 ± 0.035  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM    128000  avgt   10  3.457 ± 0.051  ns/op

This last point on varint bytes could be not very realistic if most of the protobuf messages have "similar" length

NOTE Due to the highly accurate branch predictor of my Ryzen box, the number of inputs which makes the branch-misses to become relevant is ~128K, but as usual, YMMV.

NOTE 2: not very proud of this benchmark, because, due to the variable nature of input(s), avgs times can be less meaningful - and the same applies to perf counters (if -gc perfnorm is used...), because the number of instructions/branches etc etc for one of the algorithms depends on stop bit's position

Motivation: varint decoding fall shortly when input length is not predictable Modification: Implements branchless and batchy varint decoding Result: Better performance when inputs length is not predictable

franz1981 · 2024-05-12T21:24:20Z

codec/src/main/java/io/netty/handler/codec/protobuf/ProtobufVarint32FrameDecoder.java

+            return readRawVarint40(buffer, wholeOrMore);
+        }
+        int bitsToKeep = Integer.numberOfTrailingZeros(firstOneOnStop) + 1;
+        buffer.skipBytes(bitsToKeep >> 3);


I've noticed something very fun here: if I use readerIndex(int) it get slightly slower :(
which is very weird, because (FYI @chrisvest ), skipBytes check accessibility, while readerIndex(int), nope!

chrisvest · 2024-05-12T23:39:26Z

How much use is the protobuf codec seeing? I thought grpc-java implemented their own.

franz1981 · 2024-05-13T05:09:21Z

IDK, it was a weekend fun PR, this one, so, I didn't verified yet :P

In case, I should send this PR elsewhere?

franz1981 · 2024-05-13T05:13:59Z

The interesting bit, as proposed by @jasonk000 in a different chat (informal and nerdy), is that this same approach could be reused on utf8 decoding too, although it needs special handling of the Latin case (single byte)

bonzini · 2024-05-13T05:32:19Z

codec/src/main/java/io/netty/handler/codec/protobuf/ProtobufVarint32FrameDecoder.java

+        // this is not brilliant, we have a data dependency here
+        int wholeWithContinuations = (wholeOrMore << shiftToHighlightTheWhole) >> shiftToHighlightTheWhole;
+        // mix them up as per varint spec while dropping the continuation bits
+        int result = wholeWithContinuations & 0x7F |


This can be done in two steps instead of four.

wwc = (wwc & 0x7F007F) | ((wwc & 0x7F007F00) >> 1); wwc = (wwc & 0x3FFF) | ((wwc & 0x3FFF0000) >> 2);

Probably no difference, just for the kicks.

There's a "small" but real improvement, worthy to use this, IMO, many thanks!

Let me know if the tons of comments make sense!

In theory this would also be a fit for the PEXT instruction exposed via Integer.compress since Java 19. But even when using Java 19+, that's risky because e.g. Zen 2 doesn't have proper hardware for that instruction, and other architectures fall back to a more expensive implementation.

This is very nice, actually - let me check on the JDK main

franz1981 · 2024-05-13T11:08:48Z

These are the changes after @bonzini smart suggestion!

Benchmark                                (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.readRawVarint32                SMALL         1  avgt   10  1.507 ± 0.009  ns/op
VarintDecodingBenchmark.readRawVarint32                SMALL       128  avgt   10  2.268 ± 0.019  ns/op
VarintDecodingBenchmark.readRawVarint32                SMALL    128000  avgt   10  6.259 ± 0.056  ns/op
VarintDecodingBenchmark.readRawVarint32                LARGE         1  avgt   10  3.055 ± 0.022  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM         1  avgt   10  3.094 ± 0.033  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM       128  avgt   10  3.090 ± 0.021  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM    128000  avgt   10  3.066 ± 0.019  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL         1  avgt   10  3.010 ± 0.021  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL       128  avgt   10  3.480 ± 0.025  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL    128000  avgt   10  8.041 ± 0.085  ns/op

which shows a relevant improvement, instead; it is likely that although I relied on not have dependency between the 4 bytes masks (between each others), I've introduced dep on how to obtain a clean version (the absolute value) of each, while this version won't do it (the IPC is now approaching 4.9/5 on my Ryzen, while it was at 4.7), well done!

franz1981 · 2024-05-13T11:39:38Z

would be great if any Apple users could give it a shot

bonzini · 2024-05-13T14:28:25Z

Ah, the same bit packing trick can also be applied to readRawVarint40.

Also, I think this:

        int shiftToHighlightTheWhole = 32 - bitsToKeep;
        // this is not brilliant, we have a data dependency here
        int wholeWithContinuations = (wholeOrMore << shiftToHighlightTheWhole) >> shiftToHighlightTheWhole;

can indeed be written with a lot fewer data dependencies:

    int thisVarintMask = firstOneOnStop ^ (firstOneOnStop - 1);
    int wholeWithContinuations = wholeOrMore & thisVarintMask;

The idea is that thisVarintMask has 0s above the first one of firstOneOnStop, and 1s at and below it. For example if firstOneOnStop is 0x800080 (where the last 0x80 is the only one that matters), then thisVarintMask is 0xFF.

On x86, computing thisVarintMask is even a single BLSMSK instruction! (And indeed my train of thought started with "there must be a BMI instruction for that and maybe you can write it in Java")

franz1981 · 2024-05-13T14:33:28Z

Thanks @bonzini I didn't knew of the BLSMSK instruction (!!!): I had this previous comment before d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91

which should be similar (because I still need the bitsToKeep) - but I have to double check the asm produced...
I didn't measured any improvement at that time, worth give it a shot again!

bonzini · 2024-05-13T14:42:46Z

but I didn't measured any improvement at that time, worth give it a shot again!

Another possibility is this small optimization on 32-x.

        int bitsToStop = Integer.numberOfTrailingZeros(firstOneOnStop);
        buffer.skipBytes((bitsToStop + 1) >> 3);
        /* x <= 24, so: 32 - (x + 1) = 31 - x = x ^ 31 */
        int shiftToHighlightTheWhole = bitsToStop ^ 31;

franz1981 · 2024-05-13T14:42:58Z

And @bonzini was very right, this new version is further improving perf a bit:

Benchmark                                (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.readRawVarint32               MEDIUM         1  avgt   10  2.938 ± 0.036  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM       128  avgt   10  2.975 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM    128000  avgt   10  2.973 ± 0.039  ns/op

making this very similar in perf than the original method, for predictable inputs as well

franz1981 · 2024-05-13T17:00:20Z

Another possibility is this small optimization on 32-x.

this one doesn't seem to improve (the opposite), but as usual: my Ryzen story, I have to read the ASM first, but thanks for #14050 (comment): my previous branch at d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91 was broken, because I didn't used the ^ to null out the garbage left to the important bits

Motivation: varint decoding fall shortly when input length is not predictable Modification: Implements branchless and batchy varint decoding Result: Better performance when inputs length is not predictable

chrisvest · 2024-05-14T22:50:36Z

Thanks, @franz1981 !

franz1981 force-pushed the 4.1_branchless_varint branch 2 times, most recently from 6117214 to d029cb3 Compare May 12, 2024 20:29

Branchless varint decoding

57defc5

Motivation: varint decoding fall shortly when input length is not predictable Modification: Implements branchless and batchy varint decoding Result: Better performance when inputs length is not predictable

franz1981 force-pushed the 4.1_branchless_varint branch from d029cb3 to 57defc5 Compare May 12, 2024 21:12

franz1981 commented May 12, 2024

View reviewed changes

bonzini reviewed May 13, 2024

View reviewed changes

Added the bonzini version

1e1a66a

franz1981 marked this pull request as ready for review May 13, 2024 11:09

Added the bonzini version

d44c410

chrisvest approved these changes May 14, 2024

View reviewed changes

chrisvest merged commit 7ad2b91 into netty:4.1 May 14, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branchless varint decoding #14050

Branchless varint decoding #14050

franz1981 commented May 12, 2024

franz1981 commented May 12, 2024 •

edited

franz1981 May 12, 2024

chrisvest commented May 12, 2024

franz1981 commented May 13, 2024

franz1981 commented May 13, 2024 •

edited

bonzini May 13, 2024

franz1981 May 13, 2024 •

edited

SirYwell May 13, 2024

franz1981 May 13, 2024

franz1981 commented May 13, 2024 •

edited

franz1981 commented May 13, 2024

bonzini commented May 13, 2024 •

edited

franz1981 commented May 13, 2024 •

edited

bonzini commented May 13, 2024

franz1981 commented May 13, 2024

franz1981 commented May 13, 2024

chrisvest commented May 14, 2024

Branchless varint decoding #14050

Branchless varint decoding #14050

Conversation

franz1981 commented May 12, 2024

franz1981 commented May 12, 2024 • edited

franz1981 May 12, 2024

Choose a reason for hiding this comment

chrisvest commented May 12, 2024

franz1981 commented May 13, 2024

franz1981 commented May 13, 2024 • edited

bonzini May 13, 2024

Choose a reason for hiding this comment

franz1981 May 13, 2024 • edited

Choose a reason for hiding this comment

SirYwell May 13, 2024

Choose a reason for hiding this comment

franz1981 May 13, 2024

Choose a reason for hiding this comment

franz1981 commented May 13, 2024 • edited

franz1981 commented May 13, 2024

bonzini commented May 13, 2024 • edited

franz1981 commented May 13, 2024 • edited

bonzini commented May 13, 2024

franz1981 commented May 13, 2024

franz1981 commented May 13, 2024

chrisvest commented May 14, 2024

franz1981 commented May 12, 2024 •

edited

franz1981 commented May 13, 2024 •

edited

franz1981 May 13, 2024 •

edited

franz1981 commented May 13, 2024 •

edited

bonzini commented May 13, 2024 •

edited

franz1981 commented May 13, 2024 •

edited