Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branchless varint decoding #14050

Merged
merged 3 commits into from
May 14, 2024
Merged

Conversation

franz1981
Copy link
Contributor

Motivation:

varint decoding fall shortly when input length is not predictable

Modification:

Implements branchless and batchy varint decoding

Result:

Better performance when inputs length is not predictable

@franz1981
Copy link
Contributor Author

franz1981 commented May 12, 2024

These are the results of the benchmark on my Ryzen box

Benchmark                                   (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.oldReadRawVarint32                SMALL         1  avgt   10  1.547 ± 0.021  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                SMALL       128  avgt   10  2.460 ± 0.023  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                SMALL    128000  avgt   10  6.519 ± 0.092  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                LARGE         1  avgt   10  3.936 ± 0.014  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM         1  avgt   10  2.823 ± 0.022  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM       128  avgt   10  2.791 ± 0.029  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM    128000  avgt   10  7.799 ± 0.026  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL         1  avgt   10  2.812 ± 0.043  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL       128  avgt   10  2.847 ± 0.032  ns/op
VarintDecodingBenchmark.oldReadRawVarint32                  ALL    128000  avgt   10  8.094 ± 0.015  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL         1  avgt   10  1.672 ± 0.040  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL       128  avgt   10  2.547 ± 0.047  ns/op
VarintDecodingBenchmark.readRawVarint32                   SMALL    128000  avgt   10  6.572 ± 0.063  ns/op
VarintDecodingBenchmark.readRawVarint32                   LARGE         1  avgt   10  3.724 ± 0.016  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM         1  avgt   10  3.348 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM       128  avgt   10  3.472 ± 0.035  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM    128000  avgt   10  3.457 ± 0.051  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL         1  avgt   10  3.378 ± 0.041  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL       128  avgt   10  3.877 ± 0.041  ns/op
VarintDecodingBenchmark.readRawVarint32                     ALL    128000  avgt   10  8.413 ± 0.092  ns/op

The interesting ones are the the MEDIUM which assume to have >= 4 bytes to read (which is very likely), but a fair distribution of variable length varint (1, 2, 3, 4 bytes) .

Benchmark                                   (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM         1  avgt   10  2.823 ± 0.022  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM       128  avgt   10  2.791 ± 0.029  ns/op
VarintDecodingBenchmark.oldReadRawVarint32               MEDIUM    128000  avgt   10  7.799 ± 0.026  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM         1  avgt   10  3.348 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM       128  avgt   10  3.472 ± 0.035  ns/op
VarintDecodingBenchmark.readRawVarint32                  MEDIUM    128000  avgt   10  3.457 ± 0.051  ns/op

This last point on varint bytes could be not very realistic if most of the protobuf messages have "similar" length

NOTE Due to the highly accurate branch predictor of my Ryzen box, the number of inputs which makes the branch-misses to become relevant is ~128K, but as usual, YMMV.

NOTE 2: not very proud of this benchmark, because, due to the variable nature of input(s), avgs times can be less meaningful - and the same applies to perf counters (if -gc perfnorm is used...), because the number of instructions/branches etc etc for one of the algorithms depends on stop bit's position

@franz1981 franz1981 force-pushed the 4.1_branchless_varint branch 2 times, most recently from 6117214 to d029cb3 Compare May 12, 2024 20:29
Motivation:

varint decoding fall shortly when input length is not predictable

Modification:

Implements branchless and batchy varint decoding

Result:

Better performance when inputs length is not predictable
return readRawVarint40(buffer, wholeOrMore);
}
int bitsToKeep = Integer.numberOfTrailingZeros(firstOneOnStop) + 1;
buffer.skipBytes(bitsToKeep >> 3);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed something very fun here: if I use readerIndex(int) it get slightly slower :(
which is very weird, because (FYI @chrisvest ), skipBytes check accessibility, while readerIndex(int), nope!

@chrisvest
Copy link
Contributor

How much use is the protobuf codec seeing? I thought grpc-java implemented their own.

@franz1981
Copy link
Contributor Author

IDK, it was a weekend fun PR, this one, so, I didn't verified yet :P

In case, I should send this PR elsewhere?

@franz1981
Copy link
Contributor Author

franz1981 commented May 13, 2024

The interesting bit, as proposed by @jasonk000 in a different chat (informal and nerdy), is that this same approach could be reused on utf8 decoding too, although it needs special handling of the Latin case (single byte)

// this is not brilliant, we have a data dependency here
int wholeWithContinuations = (wholeOrMore << shiftToHighlightTheWhole) >> shiftToHighlightTheWhole;
// mix them up as per varint spec while dropping the continuation bits
int result = wholeWithContinuations & 0x7F |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in two steps instead of four.

wwc = (wwc & 0x7F007F) | ((wwc & 0x7F007F00) >> 1);
wwc = (wwc & 0x3FFF) | ((wwc & 0x3FFF0000) >> 2);

Probably no difference, just for the kicks.

Copy link
Contributor Author

@franz1981 franz1981 May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a "small" but real improvement, worthy to use this, IMO, many thanks!

Let me know if the tons of comments make sense!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this would also be a fit for the PEXT instruction exposed via Integer.compress since Java 19. But even when using Java 19+, that's risky because e.g. Zen 2 doesn't have proper hardware for that instruction, and other architectures fall back to a more expensive implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, actually - let me check on the JDK main

@franz1981
Copy link
Contributor Author

franz1981 commented May 13, 2024

These are the changes after @bonzini smart suggestion!

Benchmark                                (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.readRawVarint32                SMALL         1  avgt   10  1.507 ± 0.009  ns/op
VarintDecodingBenchmark.readRawVarint32                SMALL       128  avgt   10  2.268 ± 0.019  ns/op
VarintDecodingBenchmark.readRawVarint32                SMALL    128000  avgt   10  6.259 ± 0.056  ns/op
VarintDecodingBenchmark.readRawVarint32                LARGE         1  avgt   10  3.055 ± 0.022  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM         1  avgt   10  3.094 ± 0.033  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM       128  avgt   10  3.090 ± 0.021  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM    128000  avgt   10  3.066 ± 0.019  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL         1  avgt   10  3.010 ± 0.021  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL       128  avgt   10  3.480 ± 0.025  ns/op
VarintDecodingBenchmark.readRawVarint32                  ALL    128000  avgt   10  8.041 ± 0.085  ns/op

which shows a relevant improvement, instead; it is likely that although I relied on not have dependency between the 4 bytes masks (between each others), I've introduced dep on how to obtain a clean version (the absolute value) of each, while this version won't do it (the IPC is now approaching 4.9/5 on my Ryzen, while it was at 4.7), well done!

@franz1981 franz1981 marked this pull request as ready for review May 13, 2024 11:09
@franz1981
Copy link
Contributor Author

would be great if any Apple users could give it a shot

@bonzini
Copy link

bonzini commented May 13, 2024

Ah, the same bit packing trick can also be applied to readRawVarint40.

Also, I think this:

        int shiftToHighlightTheWhole = 32 - bitsToKeep;
        // this is not brilliant, we have a data dependency here
        int wholeWithContinuations = (wholeOrMore << shiftToHighlightTheWhole) >> shiftToHighlightTheWhole;

can indeed be written with a lot fewer data dependencies:

    int thisVarintMask = firstOneOnStop ^ (firstOneOnStop - 1);
    int wholeWithContinuations = wholeOrMore & thisVarintMask;

The idea is that thisVarintMask has 0s above the first one of firstOneOnStop, and 1s at and below it. For example if firstOneOnStop is 0x800080 (where the last 0x80 is the only one that matters), then thisVarintMask is 0xFF.

On x86, computing thisVarintMask is even a single BLSMSK instruction! (And indeed my train of thought started with "there must be a BMI instruction for that and maybe you can write it in Java")

@franz1981
Copy link
Contributor Author

franz1981 commented May 13, 2024

Thanks @bonzini I didn't knew of the BLSMSK instruction (!!!): I had this previous comment before d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91

which should be similar (because I still need the bitsToKeep) - but I have to double check the asm produced...
I didn't measured any improvement at that time, worth give it a shot again!

@bonzini
Copy link

bonzini commented May 13, 2024

but I didn't measured any improvement at that time, worth give it a shot again!

Another possibility is this small optimization on 32-x.

        int bitsToStop = Integer.numberOfTrailingZeros(firstOneOnStop);
        buffer.skipBytes((bitsToStop + 1) >> 3);
        /* x <= 24, so: 32 - (x + 1) = 31 - x = x ^ 31 */
        int shiftToHighlightTheWhole = bitsToStop ^ 31;

@franz1981
Copy link
Contributor Author

And @bonzini was very right, this new version is further improving perf a bit:

Benchmark                                (inputDistribution)  (inputs)  Mode  Cnt  Score   Error  Units
VarintDecodingBenchmark.readRawVarint32               MEDIUM         1  avgt   10  2.938 ± 0.036  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM       128  avgt   10  2.975 ± 0.024  ns/op
VarintDecodingBenchmark.readRawVarint32               MEDIUM    128000  avgt   10  2.973 ± 0.039  ns/op

making this very similar in perf than the original method, for predictable inputs as well

@franz1981
Copy link
Contributor Author

Another possibility is this small optimization on 32-x.

this one doesn't seem to improve (the opposite), but as usual: my Ryzen story, I have to read the ASM first, but thanks for #14050 (comment): my previous branch at d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91 was broken, because I didn't used the ^ to null out the garbage left to the important bits

@chrisvest chrisvest merged commit 7ad2b91 into netty:4.1 May 14, 2024
17 checks passed
chrisvest pushed a commit that referenced this pull request May 14, 2024
Motivation:

varint decoding fall shortly when input length is not predictable

Modification:

Implements branchless and batchy varint decoding

Result:

Better performance when inputs length is not predictable
@chrisvest
Copy link
Contributor

Thanks, @franz1981 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants