Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873

Open
manaswini05 opened this issue Jul 28, 2022 · 7 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-runtime Module: tokio/runtime

Comments

@manaswini05
Copy link

Version
List the versions of all tokio crates you are using. The easiest way to get
this information is using cargo tree subcommand:

cargo tree | grep tokio

Platform
The output of uname -a (UNIX), or version and 32 or 64-bit (Windows)

Description

Our real-life service was using tokio v.13.1 and we decided to upgrade it. Upon upgrading to v1.16.1, we observed a significant performance degradation as our response times increased and CPU usage dropped. For example, while running stability tests, our baseline CPU usage (with v1.13.1) for our service was around 92% while the CPU usage with v.16.1 dropped to 68%. Just a note that we are accessing this service from multiple other services and it is the same traffic sent to both instances.

Looking at the changes in v1.16.1, we decided to revert PR #4383 (#4383) that was introduced in this version by creating a custom branch. We then compared the CPU usage percentages with and without PR #4383. The CPU percentage on the instance with the PR #4383 reverted returned to usual while the other instance had a CPU usage drop (similar to how it was mentioned above). The traffic sent to both instances is the same but the only difference is whether that PR is being reverted or not.

For now, we have forked the tokio repo and are using a custom branch with PR #4383 reverted.

Please let me know if any more details need to be provided.

@manaswini05 manaswini05 added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Jul 28, 2022
@Darksonn Darksonn added the M-runtime Module: tokio/runtime label Jul 28, 2022
@carllerche
Copy link
Member

Thanks for the report. To date, all reports of performance degradation after the referenced patch has landed has been due to bugs in the application code. If your application (or library) somehow incorrectly relied on those wake ups, it would translate to a slow down. On my end, I have not observed any real world application impacted negatively by this patch.

That said, it is obviously possible that a bug was introduced to Tokio, but for us to investigate, we would need some sort of reproduction as we have not witnessed your behavior here.

@carllerche
Copy link
Member

More specifically, what I have seen is performance degradation when the application blocks the runtime too much. I would try using something like tokio-metrics and follow some of the steps in there and report back.

@Noah-Kennedy
Copy link
Contributor

@manaswini05 can you get a benchmark program hacked up that repros this perf regression? that might make analysis easier.

@fabioberger
Copy link

fabioberger commented Aug 4, 2022

Here is a case where instead of performance degradation, including #4383 causes the RPC service to eventually stall completely. Could be an application logic bug but it's not clear how they might be using wakeups incorrectly: solana-labs/solana#24644 (relevant source code: https://github.com/solana-labs/solana/blob/master/rpc/src/rpc_service.rs#L385).

@manaswini05
Copy link
Author

@carllerche @Noah-Kennedy Thanks for getting back and sflr! I am trying to integrate tokio-metrics into our system. Will report back here once I have more stats on this

@Noah-Kennedy
Copy link
Contributor

@manaswini05 It would also be awesome if you could make a small, minified demo program that shows this issue in the same manner as your real codebase, if you find that this isn't an application bug. It would be awesome to have something to investigate and look into.

@manaswini05
Copy link
Author

Hey guys! I was unable to recreate a smaller demo program but I integrated tokio metrics into our real codebase to get some run time metrics. This was done using the RuntimeMonitor and the metrics were printed out at a duration of 60 mins. During testing, the traffic sent to both versions is the same.

Runtime Metrics with Tokio 1.20.1 - PR #4383 reverted:

RuntimeMetrics { workers_count: 16, total_park_count: 1124623, max_park_count: 153165, min_park_count: 0, total_noop_count: 563934, max_noop_count: 76738, min_noop_count: 0, total_steal_count: 2615, max_steal_count: 362, min_steal_count: 0, num_remote_schedules: 112, total_local_schedule_count: 1684244, max_local_schedule_count: 229817, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1684349, max_polls_count: 229840, min_polls_count: 0, total_busy_duration: 3062.260306526s, max_busy_duration: 413.120997135s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.001551072s }

RuntimeMetrics { workers_count: 16, total_park_count: 719978, max_park_count: 97825, min_park_count: 0, total_noop_count: 361242, max_noop_count: 49046, min_noop_count: 0, total_steal_count: 1613, max_steal_count: 226, min_steal_count: 0, num_remote_schedules: 120, total_local_schedule_count: 1075970, max_local_schedule_count: 146751, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1076091, max_polls_count: 146746, min_polls_count: 0, total_busy_duration: 1881.134028985s, max_busy_duration: 259.592728045s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.001619761s }

Runtime Metrics with the official Tokio 1.20.1:

RuntimeMetrics { workers_count: 16, total_park_count: 475329, max_park_count: 61497, min_park_count: 0, total_noop_count: 59113, max_noop_count: 7671, min_noop_count: 0, total_steal_count: 57711, max_steal_count: 7503, min_steal_count: 0, num_remote_schedules: 111, total_local_schedule_count: 1255088, max_local_schedule_count: 162645, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 1255193, max_polls_count: 162741, min_polls_count: 0, total_busy_duration: 2045.430793874s, max_busy_duration: 262.234093234s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.002078951s }

RuntimeMetrics { workers_count: 16, total_park_count: 325214, max_park_count: 44887, min_park_count: 0, total_noop_count: 38368, max_noop_count: 5292, min_noop_count: 0, total_steal_count: 36853, max_steal_count: 4984, min_steal_count: 0, num_remote_schedules: 120, total_local_schedule_count: 863549, max_local_schedule_count: 119858, min_local_schedule_count: 0, total_overflow_count: 0, max_overflow_count: 0, min_overflow_count: 0, total_polls_count: 863667, max_polls_count: 119659, min_polls_count: 0, total_busy_duration: 1314.888131315s, max_busy_duration: 189.353310723s, min_busy_duration: 0ns, injection_queue_depth: 0, total_local_queue_depth: 0, max_local_queue_depth: 0, min_local_queue_depth: 0, elapsed: 3600.003734898s }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-runtime Module: tokio/runtime
Projects
None yet
Development

No branches or pull requests

5 participants