Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

manaswini05 · 2022-07-28T04:08:35Z

Version
List the versions of all tokio crates you are using. The easiest way to get
this information is using cargo tree subcommand:

cargo tree | grep tokio

Platform
The output of uname -a (UNIX), or version and 32 or 64-bit (Windows)

Description

Our real-life service was using tokio v.13.1 and we decided to upgrade it. Upon upgrading to v1.16.1, we observed a significant performance degradation as our response times increased and CPU usage dropped. For example, while running stability tests, our baseline CPU usage (with v1.13.1) for our service was around 92% while the CPU usage with v.16.1 dropped to 68%. Just a note that we are accessing this service from multiple other services and it is the same traffic sent to both instances.

Looking at the changes in v1.16.1, we decided to revert PR #4383 (#4383) that was introduced in this version by creating a custom branch. We then compared the CPU usage percentages with and without PR #4383. The CPU percentage on the instance with the PR #4383 reverted returned to usual while the other instance had a CPU usage drop (similar to how it was mentioned above). The traffic sent to both instances is the same but the only difference is whether that PR is being reverted or not.

For now, we have forked the tokio repo and are using a custom branch with PR #4383 reverted.

Please let me know if any more details need to be provided.

The text was updated successfully, but these errors were encountered:

carllerche · 2022-07-28T15:05:01Z

Thanks for the report. To date, all reports of performance degradation after the referenced patch has landed has been due to bugs in the application code. If your application (or library) somehow incorrectly relied on those wake ups, it would translate to a slow down. On my end, I have not observed any real world application impacted negatively by this patch.

That said, it is obviously possible that a bug was introduced to Tokio, but for us to investigate, we would need some sort of reproduction as we have not witnessed your behavior here.

carllerche · 2022-07-28T15:07:55Z

More specifically, what I have seen is performance degradation when the application blocks the runtime too much. I would try using something like tokio-metrics and follow some of the steps in there and report back.

Noah-Kennedy · 2022-07-28T15:12:52Z

@manaswini05 can you get a benchmark program hacked up that repros this perf regression? that might make analysis easier.

fabioberger · 2022-08-04T17:49:14Z

Here is a case where instead of performance degradation, including #4383 causes the RPC service to eventually stall completely. Could be an application logic bug but it's not clear how they might be using wakeups incorrectly: solana-labs/solana#24644 (relevant source code: https://github.com/solana-labs/solana/blob/master/rpc/src/rpc_service.rs#L385).

manaswini05 · 2022-08-05T15:15:04Z

@carllerche @Noah-Kennedy Thanks for getting back and sflr! I am trying to integrate tokio-metrics into our system. Will report back here once I have more stats on this

Noah-Kennedy · 2022-08-05T17:10:52Z

@manaswini05 It would also be awesome if you could make a small, minified demo program that shows this issue in the same manner as your real codebase, if you find that this isn't an application bug. It would be awesome to have something to investigate and look into.

manaswini05 · 2022-08-18T05:27:05Z

Hey guys! I was unable to recreate a smaller demo program but I integrated tokio metrics into our real codebase to get some run time metrics. This was done using the RuntimeMonitor and the metrics were printed out at a duration of 60 mins. During testing, the traffic sent to both versions is the same.

Runtime Metrics with Tokio 1.20.1 - PR #4383 reverted:

RuntimeMetrics { workers_count: 16, total_park_count: 1124623, max_park_count: 153165, min_park_count: 0, total_noop_count: 563934, max_noop_count: 76738, min_noop_count: 0, total_steal_count: 2615, max_steal_count: 362, min_steal_count: 0, num_remote_schedules: 112, total_local_schedule_count: 1684244, max_local_schedule_count: 229817, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1684349, max_polls_count: 229840, min_polls_count: 0, total_busy_duration: 3062.260306526s, max_busy_duration: 413.120997135s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.001551072s }

RuntimeMetrics { workers_count: 16, total_park_count: 719978, max_park_count: 97825, min_park_count: 0, total_noop_count: 361242, max_noop_count: 49046, min_noop_count: 0, total_steal_count: 1613, max_steal_count: 226, min_steal_count: 0, num_remote_schedules: 120, total_local_schedule_count: 1075970, max_local_schedule_count: 146751, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1076091, max_polls_count: 146746, min_polls_count: 0, total_busy_duration: 1881.134028985s, max_busy_duration: 259.592728045s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.001619761s }

Runtime Metrics with the official Tokio 1.20.1:

RuntimeMetrics { workers_count: 16, total_park_count: 475329, max_park_count: 61497, min_park_count: 0, total_noop_count: 59113, max_noop_count: 7671, min_noop_count: 0, total_steal_count: 57711, max_steal_count: 7503, min_steal_count: 0, num_remote_schedules: 111, total_local_schedule_count: 1255088, max_local_schedule_count: 162645, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1255193, max_polls_count: 162741, min_polls_count: 0, total_busy_duration: 2045.430793874s, max_busy_duration: 262.234093234s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.002078951s }

RuntimeMetrics { workers_count: 16, total_park_count: 325214, max_park_count: 44887, min_park_count: 0, total_noop_count: 38368, max_noop_count: 5292, min_noop_count: 0, total_steal_count: 36853, max_steal_count: 4984, min_steal_count: 0, num_remote_schedules: 120, total_local_schedule_count: 863549, max_local_schedule_count: 119858, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 863667, max_polls_count: 119659, min_polls_count: 0, total_busy_duration: 1314.888131315s, max_busy_duration: 189.353310723s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.003734898s }

manaswini05 added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Jul 28, 2022

Darksonn added the M-runtime Module: tokio/runtime label Jul 28, 2022

mschneider mentioned this issue Dec 21, 2022

RPC server not responsive possible regression in 1.9.16 solana-labs/solana#24644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

manaswini05 commented Jul 28, 2022

carllerche commented Jul 28, 2022

carllerche commented Jul 28, 2022

Noah-Kennedy commented Jul 28, 2022

fabioberger commented Aug 4, 2022 •

edited

manaswini05 commented Aug 5, 2022

Noah-Kennedy commented Aug 5, 2022

manaswini05 commented Aug 18, 2022

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

Comments

manaswini05 commented Jul 28, 2022

carllerche commented Jul 28, 2022

carllerche commented Jul 28, 2022

Noah-Kennedy commented Jul 28, 2022

fabioberger commented Aug 4, 2022 • edited

manaswini05 commented Aug 5, 2022

Noah-Kennedy commented Aug 5, 2022

manaswini05 commented Aug 18, 2022

fabioberger commented Aug 4, 2022 •

edited