New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scalability issue due to checkcast on context's invoke operations #12806
Conversation
Motivation: ChannelDuplexHandler can implements both ChannelOutboundHandler and ChannelInboundHandler causing a scalability issue due to checkcast due to https://bugs.openjdk.org/browse/JDK-8180450 Modifications: Peeling-off invoke methods turning the checkcast vs interfaces into an instanceof vs ChannelDuplexHandler, saving the scalability issue to happen. Sadly, if users manually implements both ChannelOutboundHandler and ChannelInboundHandler without extending ChannelDuplexHandler the fix won't be enough. Result: Scalable duplex channel handler operations.
In theory this is already working @normanmaurer @chrisvest : 4.5 M req/sec -> 5.8 M req/sec And flamegraphs looks so much better to and the same for the outbound side too: to In short, no weird costs (that was costing as much has http decoding/encoding) while firing/traversing the channel pipeline |
FYI @doom369 |
FYI given that head/tail element of the pipeline are concrete class with high probability of being observed, it seems that adding checks vs them won't help that much because the JIT is quite capable of guarding for the "frequent" cases without making us to change any line of code, but I'm opened to change this and add explicit equality check(s) too (not using any instanceof), if it makes sense and not relies on internal JDK counters for this. |
I agree |
transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java
Outdated
Show resolved
Hide resolved
Agree. However, from the other side - you never know when and if a certain JIT optimization will kick in. That's, I would say, the main problem with JIT. You can't predict it and thus you can't rely on it. So the only real option is to profile under the specific conditions that fit your use case. Also, let's not forget that JIT is not free as well, especially on the server start. |
Re the head/tail instance checks I have tried doing it manually by hitting different probabilities to match what the JIT decide for they bytecode MethodData and it wasn't bringing any visible improvement. |
Restarted the build because of #12809 |
transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java
Outdated
Show resolved
Hide resolved
4d27222
to
747e470
Compare
@normanmaurer Running benchmarks again on this version (shouldn't be a problem TBH, but better be safe...) |
numbers on the CI are strange - I'm investigating if it's due this last commit |
@franz1981 Ping me when it's ready for review again. |
transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java
Outdated
Show resolved
Hide resolved
@franz1981 is it also true that we don't have this issue in netty 5 anymore as there is only |
Yep @normanmaurer it shouldn't be a problem anymore but better have a proper reproducer and verify it... |
Yep
|
In the meantime we're stabilizing the perf CI with a proper system test, I've created a microbenchmark to help reproducing this;
Now:
while on 4.1:
The Please don't compare the same test across versions, because right now we benefit from bimorphic inlining of duplex invoke calls (for this bench), while on 4.1 we have Below few metrics using Now, parallel:
And single threaded:
Most numbers are nearly 4 times mores as expected by the benchmark parallelism, without outliers. while on 4.1,
And single-threaded:
It presents a very different picture, with 4X more "hits" on shared lines, sign of some form of heavy contention. Using 0x00007f1065356eee: mov 0x18(%rsi),%r10 is indeed getting some huge cost (although reported one instruction later due to skidding). C2, level 4, io.netty.microbench.channel.DefaultChannelPipelineDuplexHandlerBenchmark$2::channelReadComplete, version 877 (483 bytes)
0.11% ↘ 0x00007f1065356edc: mov 0x8(%r11),%r10d
0x00007f1065356ee0: movabs $0x0,%rsi
0.01% 0x00007f1065356eea: lea (%rsi,%r10,8),%rsi
0.03% 0x00007f1065356eee: mov 0x18(%rsi),%r10
12.80% 0x00007f1065356ef2: movabs $0x1001331b8,%rax ; {metadata('io/netty/channel/ChannelInboundHandler')}
0.01% 0x00007f1065356efc: cmp %rax,%r10
╭ 0x00007f1065356eff: jne 0x00007f1065356f24 ;*checkcast
│ ; - io.netty.channel.AbstractChannelHandlerContext::invokeChannelReadComplete@11 (line 410)
│ ; - io.netty.channel.AbstractChannelHandlerContext::invokeChannelReadComplete@15 (line 397)
│ ; - io.netty.channel.AbstractChannelHandlerContext::fireChannelReadComplete@6 (line 390)
│ ; - io.netty.microbench.channel.DefaultChannelPipelineDuplexHandlerBenchmark$2::channelReadComplete@1 (line 50)
0.38% │ ↗ 0x00007f1065356f01: mov %r11,%rsi
0.05% │ │ 0x00007f1065356f04: mov 0x8(%rsp),%rdx
0.38% │ │ 0x00007f1065356f09: movabs $0xffffffffffffffff,%rax
│ │ 0x00007f1065356f13: callq 0x00007f1065046020 ; OopMap{[0]=Oop [8]=Oop off=440}
│ │ ;*invokeinterface channelReadComplete
│ │ ; - io.netty.channel.AbstractChannelHandlerContext::invokeChannelReadComplete@15 (line 410)
│ │ ; - io.netty.channel.AbstractChannelHandlerContext::invokeChannelReadComplete@15 (line 397)
│ │ ; - io.netty.channel.AbstractChannelHandlerContext::fireChannelReadComplete@6 (line 390)
│ │ ; - io.netty.microbench.channel.DefaultChannelPipelineDuplexHandlerBenchmark$2::channelReadComplete@1 (line 50)
│ │ ; {virtual_call}
│ │ 0x00007f1065356f18: add $0x30,%rsp
0.03% │ │ 0x00007f1065356f1c: pop %rbp
0.14% │ │ 0x00007f1065356f1d: test %eax,0x17afc0dd(%rip) # 0x00007f107ce53000
│ │ ; {poll_return}
0.01% │ │ 0x00007f1065356f23: retq
0.10% ↘ │ 0x00007f1065356f24: push %rax
0.02% │ 0x00007f1065356f25: mov %rax,%rax
│ 0x00007f1065356f28: mov 0x20(%rsi),%rdi
0.34% │ 0x00007f1065356f2c: mov (%rdi),%ecx
0.35% │ 0x00007f1065356f2e: add $0x8,%rdi
0.01% │ 0x00007f1065356f32: test %rax,%rax
0.37% │ 0x00007f1065356f35: repnz scas %es:(%rdi),%rax
2.73% │ 0x00007f1065356f38: pop %rax
0.72% ╭│ 0x00007f1065356f39: jne 0x00007f1065356f43
││ 0x00007f1065356f3f: mov %rax,0x18(%rsi)
0.19% ↘╰ 0x00007f1065356f43: je 0x00007f1065356f01 And similarly here: C2, level 4, io.netty.channel.AbstractChannelHandlerContext::invokeFlush, version 870 (245 bytes)
↘ 0x00007f1065359963: mov 0x8(%r10),%r11d
0.02% 0x00007f1065359967: movabs $0x0,%rsi
0x00007f1065359971: lea (%rsi,%r11,8),%rsi
0.09% 0x00007f1065359975: mov 0x18(%rsi),%r11
12.50% 0x00007f1065359979: cmp %rax,%r11
0x00007f106535997c: jne 0x00007f1065359cbd ;*checkcast
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush0@4 (line 750)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush@8 (line 742)
; - io.netty.channel.AbstractChannelHandlerContext::flush@22 (line 728)
; - io.netty.microbench.channel.DefaultChannelPipelineDuplexHandlerBenchmark$3::flush@1 (line 63)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush0@-1 (line 750)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush@8 (line 742)
0.44% 0x00007f1065359982: mov 0x8(%r10),%ecx
0.55% 0x00007f1065359986: cmp $0x2002cac3,%ecx ; {metadata('io/netty/microbench/channel/DefaultChannelPipelineDuplexHandlerBenchmark$3')}
0x00007f106535998c: je 0x00007f1065359a81
0.15% 0x00007f1065359992: cmp $0x2002678f,%ecx ; {metadata('io/netty/channel/DefaultChannelPipeline$HeadContext')}
0x00007f1065359998: jne 0x00007f106535a07d ;*invokeinterface flush
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush0@8 (line 750)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush@8 (line 742)
; - io.netty.channel.AbstractChannelHandlerContext::flush@22 (line 728)
; - io.netty.microbench.channel.DefaultChannelPipelineDuplexHandlerBenchmark$3::flush@1 (line 63)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush0@-1 (line 750)
; - io.netty.channel.AbstractChannelHandlerContext::invokeFlush@8 (line 742) the same costs are not present in the new version (nor the same instructions, because, as said earlier no slow path is required at all). |
5062dd6
to
8e0185c
Compare
@chrisvest Don't think the failures to be related to this, I've just added a microbench :( |
@franz1981 once happy please merge away :) |
Run micros to check if the new changes introduce some slowdown if compared to the original code, it seems not (that's a bit weird TBH); still investigating why |
I've modified the benchmark to add the non-duplex use case too, to evaluate the impact of the changes in a pipeline without duplex handlers and this PR:
while on 4.1:
|
47a534c
to
e67fb4b
Compare
@normanmaurer @chrisvest I'm happy I've parked this till now: by accident (adding the 1 thread:
4 threads:
And this is not using any duplex handler. |
Argh :( found it, @doom369 was right final class HeadContext extends AbstractChannelHandlerContext
implements ChannelOutboundHandler, ChannelInboundHandler { and indeed my fix isn't able to cover with this yet, see myself in this same PR
usually the JIT is capable of peeling off the call-sites but sometime, nope - as this benchmark :( |
New results:
to be compared vs #12806 (comment). note: |
@normanmaurer I see that despite this fix solve most of the problems, it doesn't solve all of them, see and ... HttpClientCodecWrapper implements ChannelInboundHandler, ChannelOutboundHandler These are handlers (and I hope the list is completed) that won't be fixed by this PR and that can have some adverse scalability effect that none noticed yet; some of these are quite used too We have then an additional way to solve this:
Just accept that these specific handlers are not good and fix them with some hack I'm opened to suggestion here @normanmaurer |
@franz1981 I think we should merge this one first and then investigate the others. |
I'm still running tests with type pollution on this PR, but feel free to merge without hurrying the release, so I can still add other fixes |
@normanmaurer I would like to create an issue to track all this effort; it's useful for JDK issues as well :P |
@franz1981 thanks a lot for the amazing work! |
…tions Motivation: ChannelDuplexHandler can implements both ChannelOutboundHandler and ChannelInboundHandler causing a scalability issue due to checkcast due to https://bugs.openjdk.org/browse/JDK-8180450 Not only: there are different classes eg Http2ConnectionHandler, which implement them transitively, by using one of the 2 existing adapters (ChannelInboundAdapter, ChanneOutboundAdapters). The existing change at netty#12806 was fixing only the duplex cases, but others like the above one was still affected. Modifications: Replace the duplex type checks with broader inbound adapter ones, given that duplex is still based on it. Add outbound adapters type checks in addition to duplex ones. Result: More scalable adapters-based channel handler operations.
…tions (#13741) Motivation: ChannelDuplexHandler can implements both ChannelOutboundHandler and ChannelInboundHandler causing a scalability issue due to checkcast due to https://bugs.openjdk.org/browse/JDK-8180450 Not only: there are different classes eg Http2ConnectionHandler, which implement them transitively, by using one of the 2 existing adapters (ChannelInboundAdapter, ChanneOutboundAdapters). The existing change at #12806 was fixing only the duplex cases, but others like the above one was still affected. Modifications: Replace the duplex type checks with broader inbound adapter ones, given that duplex is still based on it. Add outbound adapters type checks in addition to duplex ones. Result: More scalable adapters-based channel handler operations.
Motivation:
ChannelDuplexHandler can implements both ChannelOutboundHandler and ChannelInboundHandler causing a scalability issue due to checkcast due to https://bugs.openjdk.org/browse/JDK-8180450
Modifications:
Peeling-off invoke methods turning the checkcast vs interfaces into an instanceof vs ChannelDuplexHandler, saving the scalability issue to happen. Sadly, if users manually implements both ChannelOutboundHandler and ChannelInboundHandler without extending ChannelDuplexHandler the fix won't be enough.
Result:
Scalable duplex channel handler operations.