-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branchless varint decoding #14050
Branchless varint decoding #14050
Conversation
These are the results of the benchmark on my Ryzen box
The interesting ones are the the
This last point on varint bytes could be not very realistic if most of the protobuf messages have "similar" length NOTE Due to the highly accurate branch predictor of my Ryzen box, the number of inputs which makes the branch-misses to become relevant is ~128K, but as usual, YMMV. NOTE 2: not very proud of this benchmark, because, due to the variable nature of input(s), avgs times can be less meaningful - and the same applies to perf counters (if |
6117214
to
d029cb3
Compare
Motivation: varint decoding fall shortly when input length is not predictable Modification: Implements branchless and batchy varint decoding Result: Better performance when inputs length is not predictable
d029cb3
to
57defc5
Compare
return readRawVarint40(buffer, wholeOrMore); | ||
} | ||
int bitsToKeep = Integer.numberOfTrailingZeros(firstOneOnStop) + 1; | ||
buffer.skipBytes(bitsToKeep >> 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed something very fun here: if I use readerIndex(int)
it get slightly slower :(
which is very weird, because (FYI @chrisvest ), skipBytes
check accessibility, while readerIndex(int)
, nope!
How much use is the protobuf codec seeing? I thought grpc-java implemented their own. |
IDK, it was a weekend fun PR, this one, so, I didn't verified yet :P In case, I should send this PR elsewhere? |
The interesting bit, as proposed by @jasonk000 in a different chat (informal and nerdy), is that this same approach could be reused on utf8 decoding too, although it needs special handling of the Latin case (single byte) |
// this is not brilliant, we have a data dependency here | ||
int wholeWithContinuations = (wholeOrMore << shiftToHighlightTheWhole) >> shiftToHighlightTheWhole; | ||
// mix them up as per varint spec while dropping the continuation bits | ||
int result = wholeWithContinuations & 0x7F | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done in two steps instead of four.
wwc = (wwc & 0x7F007F) | ((wwc & 0x7F007F00) >> 1);
wwc = (wwc & 0x3FFF) | ((wwc & 0x3FFF0000) >> 2);
Probably no difference, just for the kicks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a "small" but real improvement, worthy to use this, IMO, many thanks!
Let me know if the tons of comments make sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this would also be a fit for the PEXT instruction exposed via Integer.compress
since Java 19. But even when using Java 19+, that's risky because e.g. Zen 2 doesn't have proper hardware for that instruction, and other architectures fall back to a more expensive implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nice, actually - let me check on the JDK main
These are the changes after @bonzini smart suggestion!
which shows a relevant improvement, instead; it is likely that although I relied on not have dependency between the 4 bytes masks (between each others), I've introduced dep on how to obtain a clean version (the absolute value) of each, while this version won't do it (the IPC is now approaching 4.9/5 on my Ryzen, while it was at 4.7), well done! |
would be great if any Apple users could give it a shot |
Ah, the same bit packing trick can also be applied to readRawVarint40. Also, I think this:
can indeed be written with a lot fewer data dependencies:
The idea is that thisVarintMask has 0s above the first one of firstOneOnStop, and 1s at and below it. For example if firstOneOnStop is 0x800080 (where the last 0x80 is the only one that matters), then thisVarintMask is 0xFF. On x86, computing thisVarintMask is even a single BLSMSK instruction! (And indeed my train of thought started with "there must be a BMI instruction for that and maybe you can write it in Java") |
Thanks @bonzini I didn't knew of the BLSMSK instruction (!!!): I had this previous comment before d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91 which should be similar (because I still need the |
Another possibility is this small optimization on
|
And @bonzini was very right, this new version is further improving perf a bit:
making this very similar in perf than the original method, for predictable inputs as well |
this one doesn't seem to improve (the opposite), but as usual: my Ryzen story, I have to read the ASM first, but thanks for #14050 (comment): my previous branch at d029cb3#diff-1550582c1bcfe50538698be3c39964078cca007dc5acb1503d54416a659a80faR86-R91 was broken, because I didn't used the |
Motivation: varint decoding fall shortly when input length is not predictable Modification: Implements branchless and batchy varint decoding Result: Better performance when inputs length is not predictable
Thanks, @franz1981 ! |
Motivation:
varint decoding fall shortly when input length is not predictable
Modification:
Implements branchless and batchy varint decoding
Result:
Better performance when inputs length is not predictable