Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow controlling time flow for EmbeddedEventLoop #12459

Merged
merged 8 commits into from Jun 13, 2022

Conversation

yawkat
Copy link
Contributor

@yawkat yawkat commented Jun 10, 2022

Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

  • Introduce a protected method AbstractScheduledEventExecutor.getCurrentTimeNanos that replaces the previous static nanoTime method (now deprecated). Replace usages of nanoTime with the new method.
  • Override getCurrentTimeNanos with the new time control (freeze, unfreeze, advanceBy) features in EmbeddedEventLoop.
  • Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (ScheduledFutureTask.delayNanos).

Result:

Fixes #12433.

Local runs of the ScheduleFutureTaskBenchmark microbenchmark shows no evidence for performance impact (within error bounds of each other):

before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s

The new ScheduleFutureTaskDeadlineBenchmark shows some performance degradation:

before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to getCurrentTimeNanos is devirtualized and inlined in the absence of EmbeddedEventLoop, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through scheduledExecutorService()), and it is never as hot as in this benchmark.

Note that if an EmbeddedEventLoop is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

@chrisvest
Copy link
Contributor

Note that if an EmbeddedEventLoop is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

The EmbeddedChannel is used in the compression codecs. However, the morphism is a property of the individual call-sites, so if every call-site ends up inlining and monomorphise this call, then it still won't add any virtual call overhead.

Copy link
Contributor

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, but this is good work.

@yawkat
Copy link
Contributor Author

yawkat commented Jun 10, 2022

Hmm, I think I read somewhere once that the JVM can inline a method as-is if there is no subclass that overrides it. If such a subclass is present, but the call site is monomorphic, it can still inline but needs to add a trap (I assume the compression codecs don't actually schedule anything on the embedded loop?). I can take another look on Monday.

@chrisvest
Copy link
Contributor

Yes, that matches my understanding as well.

@normanmaurer normanmaurer added this to the 4.1.78.Final milestone Jun 13, 2022
@normanmaurer normanmaurer merged commit c18fc2b into netty:4.1 Jun 13, 2022
@normanmaurer
Copy link
Member

@yawkat thanks a lot !

@yawkat
Copy link
Contributor Author

yawkat commented Jun 13, 2022

ive run the benchmark again with a new EmbeddedChannel(); before the trials on the same jvm, and the difference is about 1%. I assume any trap required for the monomorphic call inlining is merged with the cast check from scheduledExecutorService() or something. But since the difference is so tiny I won't investigate the assembly.

normanmaurer added a commit that referenced this pull request Jun 13, 2022
Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

- Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method.
- Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`.
- Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`).

Result:

Fixes  #12433.

Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other):

```
before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s
```

The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation:

```
before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s
```

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark.

Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
normanmaurer added a commit that referenced this pull request Jun 13, 2022
Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

- Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method.
- Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`.
- Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`).

Result:

Fixes  #12433.

Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other):

```
before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s
```

The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation:

```
before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s
```

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark.

Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
Co-authored-by: Jonas Konrad <me@yawk.at>
raidyue pushed a commit to raidyue/netty that referenced this pull request Jul 8, 2022
Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

- Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method.
- Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`.
- Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`).

Result:

Fixes  netty#12433. 

Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other):

```
before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s
```

The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation:

```
before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s
```

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark.

Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
franz1981 pushed a commit to franz1981/netty that referenced this pull request Aug 22, 2022
Motivation:

Tests using EmbeddedEventLoop can run faster if they can "advance time" so that scheduled tasks (e.g. timeouts) run earlier. Additionally, "freeze time" functionality can improve reliability of such tests.

Modification:

- Introduce a protected method `AbstractScheduledEventExecutor.getCurrentTimeNanos` that replaces the previous static `nanoTime` method (now deprecated). Replace usages of `nanoTime` with the new method.
- Override `getCurrentTimeNanos` with the new time control (freeze, unfreeze, advanceBy) features in `EmbeddedEventLoop`.
- Add a microbenchmark that tests one of the sites that seemed most likely to see negative performance impact by the change (`ScheduledFutureTask.delayNanos`).

Result:

Fixes  netty#12433. 

Local runs of the `ScheduleFutureTaskBenchmark` microbenchmark shows no evidence for performance impact (within error bounds of each other):

```
before:
Benchmark                                                   (num)   Mode  Cnt    Score    Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  132.437 ± 15.116  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  694.475 ±  8.184  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   88.037 ±  4.013  ops/s
after:
Benchmark                                                   (num)   Mode  Cnt    Score   Error  Units
ScheduleFutureTaskBenchmark.scheduleCancelLotsOutsideLoop  100000  thrpt   20  149.629 ± 7.514  ops/s
ScheduleFutureTaskBenchmark.scheduleLots                   100000  thrpt   20  688.954 ± 7.831  ops/s
ScheduleFutureTaskBenchmark.scheduleLotsOutsideLoop        100000  thrpt   20   85.426 ± 1.104  ops/s
```

The new `ScheduleFutureTaskDeadlineBenchmark` shows some performance degradation:

```
before:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  60726336.795 ± 280054.533  ops/s
after:
Benchmark                                             Mode  Cnt         Score        Error  Units
ScheduleFutureTaskDeadlineBenchmark.requestDeadline  thrpt   20  56948231.480 ± 188264.092  ops/s
```

The difference is small, but it's there, so I investigated this further using jitwatch. Looking at the generated assembly, the call to `getCurrentTimeNanos` is devirtualized and inlined in the absence of `EmbeddedEventLoop`, so the code is mostly identical. However there is the added getfield and checkcast for the executor, which probably explains the discrepancy.

In my opinion this is acceptable, because the performance impact is not severe, this use is likely the worst case (virtual call through `scheduledExecutorService()`), and it is never as hot as in this benchmark.

Note that if an `EmbeddedEventLoop` is present in the application, the performance impact is likely substantially higher, because this would necessitate a virtual call. However this is not an issue for production applications, and the affected code is still not very hot.

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow EmbeddedChannel users to control the scheduler time
4 participants