Allow controlling time flow for EmbeddedEventLoop #12459

yawkat · 2022-06-10T12:58:27Z

Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

Introduce a protected method AbstractScheduledEventExecutor.getCurrentTimeNanos that replaces the previous static nanoTime method (now deprecated). Replace usages of nanoTime with the new method.
Override getCurrentTimeNanos with the new time control (freeze, unfreeze, advanceBy) features in EmbeddedEventLoop.
Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (ScheduledFutureTask.delayNanos).

Result:

Fixes #12433.

Local runs of the ScheduleFutureTaskBenchmark microbenchmark shows no evidence for performance impact (within error bounds of each other):

before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s

The new ScheduleFutureTaskDeadlineBenchmark shows some performance degradation:

before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to getCurrentTimeNanos is devirtualized and inlined in the absence of EmbeddedEventLoop, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through scheduledExecutorService()), and it is never as hot as in this benchmark.

Note that if an EmbeddedEventLoop is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

transport/src/main/java/io/netty/channel/embedded/EmbeddedChannel.java

transport/src/main/java/io/netty/channel/embedded/EmbeddedEventLoop.java

…tLoop.java Co-authored-by: Norman Maurer <norman_maurer@apple.com>

chrisvest · 2022-06-10T20:17:39Z

Note that if an EmbeddedEventLoop is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

The EmbeddedChannel is used in the compression codecs. However, the morphism is a property of the individual call-sites, so if every call-site ends up inlining and monomorphise this call, then it still won't add any virtual call overhead.

chrisvest

One small comment, but this is good work.

common/src/main/java/io/netty/util/concurrent/ScheduledFutureTask.java

yawkat · 2022-06-10T20:32:51Z

Hmm, I think I read somewhere once that the JVM can inline a method as-is if there is no subclass that overrides it. If such a subclass is present, but the call site is monomorphic, it can still inline but needs to add a trap (I assume the compression codecs don't actually schedule anything on the embedded loop?). I can take another look on Monday.

chrisvest · 2022-06-10T20:40:20Z

Yes, that matches my understanding as well.

normanmaurer · 2022-06-13T06:45:09Z

@yawkat thanks a lot !

yawkat · 2022-06-13T08:37:08Z

ive run the benchmark again with a new EmbeddedChannel(); before the trials on the same jvm, and the difference is about 1%. I assume any trap required for the monomorphic call inlining is merged with the cast check from scheduledExecutorService() or something. But since the difference is so tiny I won't investigate the assembly.

Motivation: Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests. Modification: - Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method. - Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`. - Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`). Result: Fixes #12433. Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other): ``` before: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 132.437 ± 15.116 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 694.475 ± 8.184 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 88.037 ± 4.013 ops/s after: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 149.629 ± 7.514 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 688.954 ± 7.831 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 85.426 ± 1.104 ops/s ``` The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation: ``` before: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 60726336.795 ± 280054.533 ops/s after: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 56948231.480 ± 188264.092 ops/s ``` The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy. In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark. Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot. Co-authored-by: Norman Maurer <norman_maurer@apple.com>

Motivation: Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests. Modification: - Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method. - Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`. - Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`). Result: Fixes #12433. Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other): ``` before: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 132.437 ± 15.116 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 694.475 ± 8.184 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 88.037 ± 4.013 ops/s after: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 149.629 ± 7.514 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 688.954 ± 7.831 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 85.426 ± 1.104 ops/s ``` The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation: ``` before: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 60726336.795 ± 280054.533 ops/s after: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 56948231.480 ± 188264.092 ops/s ``` The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy. In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark. Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot. Co-authored-by: Norman Maurer <norman_maurer@apple.com> Co-authored-by: Jonas Konrad <me@yawk.at>

Motivation: Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests. Modification: - Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method. - Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`. - Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`). Result: Fixes netty#12433. Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other): ``` before: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 132.437 ± 15.116 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 694.475 ± 8.184 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 88.037 ± 4.013 ops/s after: Benchmark (num) Mode Cnt Score Error Units ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop 100000 thrpt 20 149.629 ± 7.514 ops/s ScheduleFutureTaskBenchmark.scheduleLots 100000 thrpt 20 688.954 ± 7.831 ops/s ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop 100000 thrpt 20 85.426 ± 1.104 ops/s ``` The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation: ``` before: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 60726336.795 ± 280054.533 ops/s after: Benchmark Mode Cnt Score Error Units ScheduleFutureTaskDeadlineBenchmark.requestDeadline thrpt 20 56948231.480 ± 188264.092 ops/s ``` The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy. In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark. Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot. Co-authored-by: Norman Maurer <norman_maurer@apple.com>

yawkat added 3 commits June 10, 2022 12:40

embedded time changes

ece7f82

deadline benchmark

98277fc

checkstyle

ff5aefb

normanmaurer requested changes Jun 10, 2022

View reviewed changes

transport/src/main/java/io/netty/channel/embedded/EmbeddedChannel.java Show resolved Hide resolved

transport/src/main/java/io/netty/channel/embedded/EmbeddedEventLoop.java Outdated Show resolved Hide resolved

yawkat and others added 3 commits June 10, 2022 15:49

javadoc

bd3d6fd

Update transport/src/main/java/io/netty/channel/embedded/EmbeddedEven…

5003827

…tLoop.java Co-authored-by: Norman Maurer <norman_maurer@apple.com>

missing brace from web merge

c33fdf1

normanmaurer requested review from chrisvest and trustin June 10, 2022 16:07

chrisvest approved these changes Jun 10, 2022

View reviewed changes

common/src/main/java/io/netty/util/concurrent/ScheduledFutureTask.java Outdated Show resolved Hide resolved

yawkat added 2 commits June 10, 2022 22:33

move deadlineNanos

2080c99

don't use static qualifier

cf17549

Kvicii approved these changes Jun 11, 2022

View reviewed changes

normanmaurer added this to the 4.1.78.Final milestone Jun 13, 2022

normanmaurer merged commit c18fc2b into netty:4.1 Jun 13, 2022

yawkat mentioned this pull request Jan 30, 2023

Expose AbstractScheduledEventExecutor.getCurrentTimeNanos #12827

Open

mpeddada1 mentioned this pull request Feb 1, 2023

fix(java): initialize netty-shaded at run-time and add reflection configurations for netty classes googleapis/sdk-platform-java#1290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow controlling time flow for EmbeddedEventLoop #12459

Allow controlling time flow for EmbeddedEventLoop #12459

yawkat commented Jun 10, 2022

chrisvest commented Jun 10, 2022

chrisvest left a comment

yawkat commented Jun 10, 2022

chrisvest commented Jun 10, 2022

normanmaurer commented Jun 13, 2022

yawkat commented Jun 13, 2022

Allow controlling time flow for EmbeddedEventLoop #12459

Allow controlling time flow for EmbeddedEventLoop #12459

Conversation

yawkat commented Jun 10, 2022

chrisvest commented Jun 10, 2022

chrisvest left a comment

Choose a reason for hiding this comment

yawkat commented Jun 10, 2022

chrisvest commented Jun 10, 2022

normanmaurer commented Jun 13, 2022

yawkat commented Jun 13, 2022