New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler #4873
Comments
Thanks for the report. To date, all reports of performance degradation after the referenced patch has landed has been due to bugs in the application code. If your application (or library) somehow incorrectly relied on those wake ups, it would translate to a slow down. On my end, I have not observed any real world application impacted negatively by this patch. That said, it is obviously possible that a bug was introduced to Tokio, but for us to investigate, we would need some sort of reproduction as we have not witnessed your behavior here. |
More specifically, what I have seen is performance degradation when the application blocks the runtime too much. I would try using something like tokio-metrics and follow some of the steps in there and report back. |
@manaswini05 can you get a benchmark program hacked up that repros this perf regression? that might make analysis easier. |
Here is a case where instead of performance degradation, including #4383 causes the RPC service to eventually stall completely. Could be an application logic bug but it's not clear how they might be using wakeups incorrectly: solana-labs/solana#24644 (relevant source code: https://github.com/solana-labs/solana/blob/master/rpc/src/rpc_service.rs#L385). |
@carllerche @Noah-Kennedy Thanks for getting back and sflr! I am trying to integrate tokio-metrics into our system. Will report back here once I have more stats on this |
@manaswini05 It would also be awesome if you could make a small, minified demo program that shows this issue in the same manner as your real codebase, if you find that this isn't an application bug. It would be awesome to have something to investigate and look into. |
Hey guys! I was unable to recreate a smaller demo program but I integrated tokio metrics into our real codebase to get some run time metrics. This was done using the Runtime Metrics with Tokio 1.20.1 - PR #4383 reverted:
Runtime Metrics with the official Tokio 1.20.1:
|
Version
List the versions of all
tokio
crates you are using. The easiest way to getthis information is using
cargo tree
subcommand:cargo tree | grep tokio
Platform
The output of
uname -a
(UNIX), or version and 32 or 64-bit (Windows)Description
Our real-life service was using tokio v.13.1 and we decided to upgrade it. Upon upgrading to v1.16.1, we observed a significant performance degradation as our response times increased and CPU usage dropped. For example, while running stability tests, our baseline CPU usage (with v1.13.1) for our service was around 92% while the CPU usage with v.16.1 dropped to 68%. Just a note that we are accessing this service from multiple other services and it is the same traffic sent to both instances.
Looking at the changes in v1.16.1, we decided to revert PR #4383 (#4383) that was introduced in this version by creating a custom branch. We then compared the CPU usage percentages with and without PR #4383. The CPU percentage on the instance with the PR #4383 reverted returned to usual while the other instance had a CPU usage drop (similar to how it was mentioned above). The traffic sent to both instances is the same but the only difference is whether that PR is being reverted or not.
For now, we have forked the tokio repo and are using a custom branch with PR #4383 reverted.
Please let me know if any more details need to be provided.
The text was updated successfully, but these errors were encountered: