Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Matthias247 · 2021-05-27T20:01:02Z

While the hardcoded numbers might work good on some machines, they can be problematic on others.
They also make it harder to distinguish on how much time is spent on send vs receive.

This introduces a dynamic WorkLimiter which measures the elapsed time per operation, and allows to configure how much time we intend to spend on send vs receive.

This change integrates it so far for recv which is at the moment more problematic, but it can easily be extended for the send path.

Performance differences are negligible on this machine (probably because
IO_LOOP_BOUND was set to a number which works for it), but it can improve
things on less known environments.

I instrumented the endpoints receive method to see how much time it spends
on average in drive_recv.

Baseline:

Recv time: AvgTime { total: 3.280880841s, calls: 34559, avg: 94.935µs, min: 3.146µs, max: 312.574µs }
path: PathStats {a
    rtt: 511.656µs,

With this change:

Recv time: AvgTime { total: 3.333642823s, calls: 54627, avg: 61.024µs, min: 2.645µs, max: 319.147µs }
path: PathStats {
    rtt: 446.641µs,

Note that 50µs are not reached because a single recvmmsg batch takes about 30µs, so this is just rounding up to 2 batches.

When set to 200µs (for comparison purposes):

Recv time: AvgTime { total: 3.243954076s, calls: 19558, avg: 165.862µs, min: 2.525µs, max: 358.711µs }
path: PathStats {
    rtt: 700.34µs,
}

djc

The idea here makes sense to me!

Please squash these commits together, since we usually introduce new code together with changes that use them; and please consider shortening the first line of the commit message a bit.

djc · 2021-05-28T07:14:26Z

quinn/src/work_limiter.rs

+#[derive(Debug)]
+pub struct WorkLimiter {
+    /// Whether to measure the required work time, or to use the previous estimates
+    mode: LimiterMode,


Let's make these field names a bit more concise, similar to most of the code. LimiterMode can just be called Mode (it's private anyway) and we can cut work_item from all field names. I think we can also do without the _nanos suffix, keeping as documentation only.

I made most of them shorter. I kept _nanos since this is really something that I find immensly useful when looking at foreign code where the unit isn't clear, and have to point out in reviews constantly.

Also please keep in mind that there is a low downside of having longer names as long as they don't span the full line. Code readability wins in most cases by having them - and code is more often read than written..

quinn/src/work_limiter.rs

Matthias247 · 2021-05-28T16:54:52Z

CI failure is on the flaky h3 test. Can someone restart?

Ralith · 2021-05-30T21:56:50Z

Haven't carefully reviewed the implementation yet, but overall this looks reasonable to me. I'd be happier if we had empirical evidence of an environment where the hardcoded values are significantly bad, but that's somewhat mitigated by the good encapsulation here and the infrequent sampling. Gave CI a kick.

Matthias247 · 2021-05-31T18:05:54Z

Seems like even though I increased the timer and gave it a lot of grace time the test is still flaky on CI. Maybe the CI macos CI runner is rather overscheduled. I will look into in way to mock time for the test to de-flake it.

Ralith · 2021-05-31T18:17:27Z

Ultimately we want to get rid of these tests since they're a maintenance burden, we just need to investigate making up for any coverage losses that would entail.

Matthias247 · 2021-05-31T18:32:35Z

The last failure was about the newly added test, not about the H3 one. I made this one now deterministic (at the cost of some other ugliness in the implementation)

quinn/src/work_limiter.rs

Ralith · 2021-06-12T20:00:25Z

quinn/src/work_limiter.rs

+        limiter.finish_cycle();
+
+        assert!(
+            approximates(initial_batches, EXPECTED_INITIAL_BATCHES),


Do we still need approximates now that we're using mocked time?

I kept it around to accomodate for potential rounding errors, but changed the tolerance for just 10%. Haven't tested whether it would work without it.

works without, so I removed it in favor of assert_eq

…::drive_recv` dynamically This change adds a `WorkLimiter` component, which measures the amount of time required to perform some work items and will limit work based on time instead of pure iterations. It also changes the `Endpoint`s `drive_recv` method to limit receive operations based on the amount of spent time (to 50µs) using the `WorkLimiter`, instead of using the hardcoded `IO_LOOP_BOUND` counter. Performance differences are negligible on this machine (probably because `IO_LOOP_BOUND` was set to a number which works for it), but it can improve things on less known environments. I instrumented the endpoints receive method to see how much time it spends on average in `drive_recv`. **Baseline:** ``` Recv time: AvgTime { total: 3.280880841s, calls: 34559, avg: 94.935µs, min: 3.146µs, max: 312.574µs } path: PathStats { rtt: 511.656µs, ``` **With this change:** ``` Recv time: AvgTime { total: 3.333642823s, calls: 54627, avg: 61.024µs, min: 2.645µs, max: 319.147µs } path: PathStats { rtt: 446.641µs, ``` Note that 50µs are not reached because a single `recvmmsg` batch takes about 30µs, so this is just rounding up to 2 batches. **When set to 200µs (for comparison purposes):** ``` Recv time: AvgTime { total: 3.243954076s, calls: 19558, avg: 165.862µs, min: 2.525µs, max: 358.711µs } path: PathStats { rtt: 700.34µs, } ```

Matthias247 · 2021-06-12T20:28:31Z

I'm not sure what happened to those macos tests:

---- tests::echo_dualstack stdout ----
thread 'tests::echo_dualstack' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }', quinn/src/tests.rs:597:56

However I feel like those failures are unrelated to these changes

Ralith · 2021-06-12T20:48:36Z

Must be a MacOS kernel change...

Matthias247 · 2021-06-13T05:54:51Z

CI failure was most likely caused by an issue in mio 0.7.12 which was picked up by the build: tokio-rs/mio#1497

That version is now yanked, so retrying the run might lead to success

Ralith · 2021-06-13T06:49:04Z

Yep, good find.

Matthias247 force-pushed the work_limiter2 branch 2 times, most recently from c52eb0e to 14237c5 Compare May 27, 2021 20:47

djc reviewed May 28, 2021

View reviewed changes

Matthias247 force-pushed the work_limiter2 branch from 14237c5 to e32fa0d Compare May 28, 2021 16:46

Matthias247 force-pushed the work_limiter2 branch from e32fa0d to 07d5371 Compare May 31, 2021 18:30

Matthias247 commented May 31, 2021

View reviewed changes

quinn/src/work_limiter.rs Show resolved Hide resolved

Ralith previously approved these changes Jun 12, 2021

View reviewed changes

Matthias247 dismissed Ralith’s stale review via c69e73c June 12, 2021 20:17

Matthias247 force-pushed the work_limiter2 branch from 07d5371 to c69e73c Compare June 12, 2021 20:17

Matthias247 force-pushed the work_limiter2 branch from c69e73c to 35904af Compare June 12, 2021 20:19

Ralith approved these changes Jun 12, 2021

View reviewed changes

djc merged commit 62ed324 into quinn-rs:main Jun 16, 2021

Matthias247 deleted the work_limiter2 branch June 21, 2021 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Matthias247 commented May 27, 2021

djc left a comment

djc May 28, 2021

Matthias247 May 28, 2021

Matthias247 commented May 28, 2021

Ralith commented May 30, 2021 •

edited

Matthias247 commented May 31, 2021

Ralith commented May 31, 2021

Matthias247 commented May 31, 2021

Ralith Jun 12, 2021

Matthias247 Jun 12, 2021

Matthias247 Jun 12, 2021

Matthias247 commented Jun 12, 2021

Ralith commented Jun 12, 2021

Matthias247 commented Jun 13, 2021

Ralith commented Jun 13, 2021

Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Conversation

Matthias247 commented May 27, 2021

djc left a comment

Choose a reason for hiding this comment

djc May 28, 2021

Choose a reason for hiding this comment

Matthias247 May 28, 2021

Choose a reason for hiding this comment

Matthias247 commented May 28, 2021

Ralith commented May 30, 2021 • edited

Matthias247 commented May 31, 2021

Ralith commented May 31, 2021

Matthias247 commented May 31, 2021

Ralith Jun 12, 2021

Choose a reason for hiding this comment

Matthias247 Jun 12, 2021

Choose a reason for hiding this comment

Matthias247 Jun 12, 2021

Choose a reason for hiding this comment

Matthias247 commented Jun 12, 2021

Ralith commented Jun 12, 2021

Matthias247 commented Jun 13, 2021

Ralith commented Jun 13, 2021

Ralith commented May 30, 2021 •

edited