New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize method HttpStatusClass.valueOf(int) #13543
Conversation
183bf9e
to
685ad2f
Compare
685ad2f
to
aa19425
Compare
Integer division is pretty slow, while branches are often predictable. I don't think this will be a beneficial change, unless you have a benchmark showing that it's worthwhile. |
OK,I will do a benchmark to check. |
@@ -56,22 +56,20 @@ public boolean contains(int code) { | |||
* Returns the class of the specified HTTP status code. | |||
*/ | |||
public static HttpStatusClass valueOf(int code) { | |||
if (INFORMATIONAL.contains(code)) { | |||
return INFORMATIONAL; | |||
switch (code / 100) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any JMH number? and tableswitch may slower than hand arranged if else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 'tableswitch' is faster than 'if else' in most common cases. U may mean the 'lookupswitch'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending the number of cases (below 10), the JIT just use a binary-like cascade of if, regardless the bytecode used.
The real problem is the division, given are pretty heavy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franz1981 The number of cases threshold may depends on the CPU type, for example, the 'MinJumpTableSize' is 5 in sparc: https://github.com/openjdk/jdk8u/blob/master/hotspot/src/cpu/sparc/vm/c2_globals_sparc.hpp#L60.
Further, in common cases, I think the binary-search is faster than the non-optimized 'if-else'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, I tried remember by heart it, but I am biased toward my poor old X86 laptop :P
Said that, I believe bothering about the bytecode used isn't the key point here....but it's a fun exercise to microbench it, to be sure. And it requires some perfasm analysis.
Further, in common cases, I think the binary-search is faster than the non-optimized 'if-else'.
It really depends by the use cases
See #13473 benchmarks and impl for reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why the SERVER_ERROR
is the last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franz1981 , Appreciate for the benchmarks reference, will check it. As you said, the division may be the real bottleneck. I will run the benchmark to verify it.
@He-Pin , I think put SERVER_ERROR
the last doesn't mean it will be faster than binary-search, it depends by the use cases.
If the CPU load goes heavy and cause more SERVER_ERROR
code, then put SERVER_ERROR
the last will amplify the load and make things going worse.
Done a preliminary benchmark, now with 'switch-case':
Before:
Seems Besides, the newest commit that replace the 'switch-case' with 'array-index-lookup', seems gain more performance:
With Java:
And
This is a very initial benchmark, any suggestion will be welcome. @franz1981 @chrisvest @He-Pin |
Yep @langchristian96 the main things are:
Last but not least:
If using a Numa platform please run the full benchmark with numactl --localalloc -N 0 and disable turbo frequency (you can use the latency profile with tuned from RHEL If you wish). A number isn't meaningful if you don't know how/why has happened, IMHO. Note: given that 100 is a constant from JIT perspective maybe there is some math trick (eg reciprocal multiplication) aiming to replace the division with something more appropriate, given that is stored into an int. The perfasm analysis should be done to verify it |
@franz1981 Great suggestions, I will do the furthering tests to check. |
Been busy recently... Here is the latest update so far:
@franz1981 You are right, the JIT did some constant divisor optimizations. The quoted lines does the optimization of
It can be translated to Java code like:
Which means: And the GCC did the similar optimization: https://gcc.godbolt.org/z/3EqjGdfoW. Will keep updating when further tests done. :) |
@laosijikaichele Is this ready for review, or do you have more changes planned? |
@chrisvest , Sorry for the delay, there are some more changes and tests to be committed, hopefully within 12 hours will be done. :) |
Okay. No need to commit to any deadline. Just ping me when I should take a look. |
Just for reference, the optimization seems to happen at https://github.com/openjdk/jdk/blob/5726d31e56530bbe7dee61ae04b126e20cb3611d/src/hotspot/share/opto/divnode.cpp#L408-L411 which still reference Hacker's Delight |
Thanks for the suggestions from @franz1981 , I updated the benchmark code, added 'equal branch distributed' to the set-up method, added 'perf profile' which included 'branch-misses' info. Base line code: There are 2 optimizations committed for benchmark: The test inputs:
As you can see, the percentages are from big to small, which is based on the 'if-else' sequence of 4.1-branch code. There are 5 different
The hardware info(no NUMA):
The benchmark result details are pretty long, you can directly scroll down to the end, there are summary tables. :) First, benchmark with JDK-17:
Base line(if-else), with JDK-17:
'Switch-case', with JDK-17:
Compare with base line, the throughput of 'switch-case' increase[+]/decrease[-]:
'Array-index', with JDK-17:
Compare with base line, the throughput of 'array-index' increase[+]/decrease[-]:
Secondly, on the same hardware, benchmark with JDK-8:
Base line(if-else), with JDK-8:
'Switch-case', with JDK-8:
Compare with base line, the throughput of 'switch-case' increase[+]/decrease[-]:
'Array-index', with JDK-8:
Compare with base line, the throughput of 'array-index' increase[+]/decrease[-]:
Summary of the benchmark results:
Based on the current benchmark results. I think the 'array-index' optimization could gain more performance improvement, and is more worthy to use. WDYT? @chrisvest @franz1981 |
Lovely analysis, is it clear now (and many thanks for the great job!) that, assuming a fair and sensible distribution of branches/if based on input data:
I'm already happy re the state of this PR: the lookup table is pretty small too, so I won't be too concerned about the cache accesses, and sadly I'm not aware of mechanisms to verify the impact, but let me briefly explain it so maybe you'll find something, if you are curious. Lookup table approaches often wins, in cases like these, where the outcome of the lookup can vary and is not very predictable, but at the same time, the numbers on a microbenchs often deal with an artificial state of the caches in the local core, which doesn't happen in the real world, where you can have already plenty of other data in it, due to the domain logic execution; unless the real code perform some tight encoding loop performing lookups (which would amortize the initial miss, if any). I have no idea how to simulate a different behaviour, actually, but collecting @nitsanw we spoken about this very quickly in the last call some time ago |
Just some additional notes @laosijikaichele I've re read a nice article from @nitsanw to better comment (and apologize too):
In short: right now the nano benchmark still have some "noisy" infra, but numbers suggest some effect which lead to think that one approach is better than the others: I'm still not sure the nano benchmark is noisy-free enough to capture the reality (and I have no idea if it can - a static analysis of assembly can produce probably a more valuable insight in that). |
@franz1981 Thanks a lot for the suggestions, I will do more adjustment and try to reduce the 'noise', will update this when have further progress. |
@franz1981 , thanks for the suggestions, the benchmark code has been optimized to try to reduce 'noise':
The test environments/params are still the same with previous, with the new optimized benchmark code, test it again: Base line(if-else), with JDK-17:
'Switch-case', with JDK-17:
'Array-index', with JDK-17:
Base line(if-else), with JDK-8:
'Switch-case', with JDK-8:
'Array-index', with JDK-8:
In summary on chart, the X-axis is 'input link size', the Y-axis is 'throughput': According to the results, the 'array-index' optimization still gain the most improvement. @franz1981 , WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, @laosijikaichele.
And interesting find that the array version is that much faster than the alternatives.
This is all really interesting stuff..Thanks for all the work and education. |
@@ -0,0 +1,215 @@ | |||
/* | |||
* Copyright 2019 The Netty Project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2023
CircularLink cl = polluteDistributeHead; | ||
do { | ||
bh.consume(HttpStatusClass.valueOf(cl.value)); | ||
} while ((cl = cl.next) != polluteDistributeHead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have used the CircularList linked-list?
It is even more dangerous than using arrays of inputs: thes nodes will likely be more scattered on the heap (or causing trouble to the GC to compact them) and (very important) introduce a data deps which prevent prefetching of next input data (in short, the CPU cannot speculate the next data(s) to be read.
Said that, I am already happy with the current status of the code...so we can move this on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franz1981 , the reason we used the 'CircularList' is to try avoiding the 'loop unrolling' you mentioned. I will change the input from 'CircularList' to normal int arrays and test it again, then compare the results. I will add the new results soon.
As @franz1981 concerned about the impact of using the 'CircularList' as inputs, I changed the inputs to normal int arrays(commit), and the input arrays are also being re-used on each invocation to avoid GC. Base line(if-else), with JDK-17:
'Switch-case', with JDK-17:
'Array-index', with JDK-17:
Base line(if-else), with JDK-8:
'Switch-case', with JDK-8:
'Array-index', with JDK-8:
In summary on chart, the X-axis is 'input array size', the Y-axis is 'throughput': Compare with using 'CircularLink': |
return SERVER_ERROR; | ||
} | ||
return UNKNOWN; | ||
return statusArray[code / 100]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So array is super fast:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
As we noticed before, the JIT optimized the division with constant divisor to 'multiplication & bit-shift & subtraction' operations. We can translate the optimized assembly to Java method:
If
Because Java does not have For method
When goes to
With this custom div method committed, test it again: 'Array-with-custom-div' with JDK-17:
'Array-with-custom-div' with JDK-8:
Add this results to previous results, and sumarry on chart, the X-axis is 'input array size', the Y-axis is 'throughput': With JDK-17, the 'Array-with-custom-div' version improves '37% - 54%' than 'Array' version. This 'Array-with-custom-div' optimization may seem a little aggressive, because we introduced a custom division method, but still, we can discuss about this. @franz1981 @chrisvest |
I am all in to make it faster, for sure @laosijikaichele and good finding! |
@franz1981 Did you mean we change the |
…measure single invocation throughput of HttpStatusClass.valueOf()
The benchmark class
Following are the test results: Base line(if-else), with JDK-17:
'Switch-case', with JDK-17:
'Switch-case-no-profiling', with JDK-17:
'Array', with JDK-17:
'Array-with-custom-div' with JDK-17:
Base line(if-else), with JDK-8:
'Switch-case', with JDK-8:
'Array', with JDK-8:
'Array-with-custom-div' with JDK-8:
In summary on chart, the X-axis is 'input array size', the Y-axis is 'throughput': There are two notable things:
PS: When do the benchmark with JDK-17, the switch-profiling-optimization can be disabled by adding |
In case of anyone interested about the switch-profiling with JDK-17, there is a new issue created for OpenJDK, please refer https://bugs.openjdk.org/browse/JDK-8316275. |
@laosijikaichele thanks |
Motivation: The method `HttpStatusClass.valueOf(int)` can be optimized by replacing if-branches with array lookup. The array lookup is faster, because dividing by a fixed constant of 100 well by the JIT. Modification: Replace if-branches with array lookup and optimized division. Result: Faster lookup of HTTP status code class. --------- Co-authored-by: laosijikaichele <laosijikaichele>
Motivation:
The method
HttpStatusClass.valueOf(int)
can be optimized by replacing 'if branches' with 'switch cases'.The 'switch cases' can be optimized by JIT to run faster than 'if branches'.
Refer https://stackoverflow.com/a/98024.
Modification:
Replace 'if-branches' with 'switch case'.
Result:
Remove the 'if-branches'.