Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically determine the amount of work that can be performed in an endpoint iteration #1133

Merged
merged 1 commit into from Jun 16, 2021

Conversation

Matthias247
Copy link
Contributor

While the hardcoded numbers might work good on some machines, they can be problematic on others.
They also make it harder to distinguish on how much time is spent on send vs receive.

This introduces a dynamic WorkLimiter which measures the elapsed time per operation, and allows to configure how much time we intend to spend on send vs receive.

This change integrates it so far for recv which is at the moment more problematic, but it can easily be extended for the send path.

Performance differences are negligible on this machine (probably because
IO_LOOP_BOUND was set to a number which works for it), but it can improve
things on less known environments.

I instrumented the endpoints receive method to see how much time it spends
on average in drive_recv.

Baseline:

Recv time: AvgTime { total: 3.280880841s, calls: 34559, avg: 94.935µs, min: 3.146µs, max: 312.574µs }
path: PathStats {a
    rtt: 511.656µs,

With this change:

Recv time: AvgTime { total: 3.333642823s, calls: 54627, avg: 61.024µs, min: 2.645µs, max: 319.147µs }
path: PathStats {
    rtt: 446.641µs,

Note that 50µs are not reached because a single recvmmsg batch takes about 30µs, so this is just rounding up to 2 batches.

When set to 200µs (for comparison purposes):

Recv time: AvgTime { total: 3.243954076s, calls: 19558, avg: 165.862µs, min: 2.525µs, max: 358.711µs }
path: PathStats {
    rtt: 700.34µs,
}

@Matthias247 Matthias247 force-pushed the work_limiter2 branch 2 times, most recently from c52eb0e to 14237c5 Compare May 27, 2021 20:47
Copy link
Collaborator

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here makes sense to me!

Please squash these commits together, since we usually introduce new code together with changes that use them; and please consider shortening the first line of the commit message a bit.

#[derive(Debug)]
pub struct WorkLimiter {
/// Whether to measure the required work time, or to use the previous estimates
mode: LimiterMode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these field names a bit more concise, similar to most of the code. LimiterMode can just be called Mode (it's private anyway) and we can cut work_item from all field names. I think we can also do without the _nanos suffix, keeping as documentation only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made most of them shorter. I kept _nanos since this is really something that I find immensly useful when looking at foreign code where the unit isn't clear, and have to point out in reviews constantly.

Also please keep in mind that there is a low downside of having longer names as long as they don't span the full line. Code readability wins in most cases by having them - and code is more often read than written..

quinn/src/work_limiter.rs Outdated Show resolved Hide resolved
quinn/src/work_limiter.rs Outdated Show resolved Hide resolved
quinn/src/work_limiter.rs Outdated Show resolved Hide resolved
quinn/src/work_limiter.rs Outdated Show resolved Hide resolved
quinn/src/work_limiter.rs Outdated Show resolved Hide resolved
quinn/src/work_limiter.rs Show resolved Hide resolved
@Matthias247
Copy link
Contributor Author

CI failure is on the flaky h3 test. Can someone restart?

@Ralith
Copy link
Collaborator

Ralith commented May 30, 2021

Haven't carefully reviewed the implementation yet, but overall this looks reasonable to me. I'd be happier if we had empirical evidence of an environment where the hardcoded values are significantly bad, but that's somewhat mitigated by the good encapsulation here and the infrequent sampling. Gave CI a kick.

@Matthias247
Copy link
Contributor Author

Seems like even though I increased the timer and gave it a lot of grace time the test is still flaky on CI. Maybe the CI macos CI runner is rather overscheduled. I will look into in way to mock time for the test to de-flake it.

@Ralith
Copy link
Collaborator

Ralith commented May 31, 2021

Ultimately we want to get rid of these tests since they're a maintenance burden, we just need to investigate making up for any coverage losses that would entail.

@Matthias247
Copy link
Contributor Author

The last failure was about the newly added test, not about the H3 one. I made this one now deterministic (at the cost of some other ugliness in the implementation)

Ralith
Ralith previously approved these changes Jun 12, 2021
limiter.finish_cycle();

assert!(
approximates(initial_batches, EXPECTED_INITIAL_BATCHES),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need approximates now that we're using mocked time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it around to accomodate for potential rounding errors, but changed the tolerance for just 10%. Haven't tested whether it would work without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works without, so I removed it in favor of assert_eq

…::drive_recv` dynamically

This change adds a `WorkLimiter` component, which measures the amount of
time required to perform some work items and will limit work based on time
instead of pure iterations.

It also changes the `Endpoint`s  `drive_recv` method to limit receive operations
based on the amount of spent time (to 50µs) using the `WorkLimiter`, instead
of using the hardcoded `IO_LOOP_BOUND` counter.

Performance differences are negligible on this machine (probably because
`IO_LOOP_BOUND` was set to a number which works for it), but it can improve
things on less known environments.

I instrumented the endpoints receive method to see how much time it spends
on average in `drive_recv`.

**Baseline:**
```
Recv time: AvgTime { total: 3.280880841s, calls: 34559, avg: 94.935µs, min: 3.146µs, max: 312.574µs }
path: PathStats {
    rtt: 511.656µs,
```

**With this change:**
```
Recv time: AvgTime { total: 3.333642823s, calls: 54627, avg: 61.024µs, min: 2.645µs, max: 319.147µs }
path: PathStats {
    rtt: 446.641µs,
```
Note that 50µs are not reached because a single `recvmmsg` batch takes about 30µs, so this is just rounding up to 2 batches.

**When set to 200µs (for comparison purposes):**
```
Recv time: AvgTime { total: 3.243954076s, calls: 19558, avg: 165.862µs, min: 2.525µs, max: 358.711µs }
path: PathStats {
    rtt: 700.34µs,
}
```
@Matthias247
Copy link
Contributor Author

I'm not sure what happened to those macos tests:

---- tests::echo_dualstack stdout ----
thread 'tests::echo_dualstack' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }', quinn/src/tests.rs:597:56

However I feel like those failures are unrelated to these changes

@Ralith
Copy link
Collaborator

Ralith commented Jun 12, 2021

Must be a MacOS kernel change...

@Matthias247
Copy link
Contributor Author

CI failure was most likely caused by an issue in mio 0.7.12 which was picked up by the build: tokio-rs/mio#1497

That version is now yanked, so retrying the run might lead to success

@Ralith
Copy link
Collaborator

Ralith commented Jun 13, 2021

Yep, good find.

@djc djc merged commit 62ed324 into quinn-rs:main Jun 16, 2021
@Matthias247 Matthias247 deleted the work_limiter2 branch June 21, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants