Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ready-cache: Ensure cancelation can be observed #668

Merged
merged 10 commits into from Jun 17, 2022
Merged

ready-cache: Ensure cancelation can be observed #668

merged 10 commits into from Jun 17, 2022

Conversation

olix0r
Copy link
Collaborator

@olix0r olix0r commented Jun 17, 2022

tokio::task enforces a cooperative scheduling regime that can cause
oneshot::Receiver::poll to return pending after the sender has sent an
update. ReadyCache uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

Fixes #415

Co-authored-by: Eliza Weisman eliza@buoyant.io


This is based on tower-0.4.12. We probably need to make a an 0.4 branch to get a bugfix release out.

`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

Fixes #415

Co-authored-by: Eliza Weisman <eliza@buoyant.io>
@olix0r olix0r requested review from jonhoo and hawkw June 17, 2022 01:40
@olix0r olix0r marked this pull request as draft June 17, 2022 01:40
olix0r added a commit to linkerd/linkerd2-proxy that referenced this pull request Jun 17, 2022
tower-rs/tower#668 fixes a bug that can cause the load balancer to fail
to properly observe cancelations when an endpoint is replaced or
removed.

This change updates the proxy to use tower from a fixed git SHA.

Signed-off-by: Oliver Gould <ver@buoyant.io>
olix0r added a commit to linkerd/linkerd2-proxy that referenced this pull request Jun 17, 2022
tower-rs/tower#668 fixes a bug that can cause the load balancer to fail
to properly observe cancelations when an endpoint is replaced or
removed.

This change updates the proxy to use tower from a fixed git SHA.

Signed-off-by: Oliver Gould <ver@buoyant.io>
@carllerche points out that, even with `unconstrained`, there are no guarantees
that a `oneshot::Receiver` will observe the sent value immediately, even
when using `tokio::task::unconstrained`.
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, i think we should go ahead and merge it, release a 0.4.13, and then get it forward-ported to master!

i commented on some very minor style nits, but there are no blockers here IMO!

tower/src/ready_cache/cache.rs Outdated Show resolved Hide resolved
if let Poll::Ready(r) = Pin::new(&mut fut).poll(cx) {
// This MUST return ready as soon as the sender has been notified so
// that we don't return a service that has been canceled, so we disable
// cooperative scheduling on the receiver. Otherwise, the receiver can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit:

Suggested change
// cooperative scheduling on the receiver. Otherwise, the receiver can
// cooperative yielding on the receiver. Otherwise, the receiver can

it's all cooperative scheduling :)

Comment on lines 400 to 405
// This MUST return ready as soon as the sender has been notified so
// that we don't return a service that has been canceled, so we disable
// cooperative scheduling on the receiver. Otherwise, the receiver can
// sporadically return pending even though the sender has fired.
let mut cancel =
tokio::task::unconstrained(self.cancel.as_mut().expect("polled after complete"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit, take it or leave it: might consider something like this, so that it's very obvious that the comment is referring to the use of unconstrained:

Suggested change
// This MUST return ready as soon as the sender has been notified so
// that we don't return a service that has been canceled, so we disable
// cooperative scheduling on the receiver. Otherwise, the receiver can
// sporadically return pending even though the sender has fired.
let mut cancel =
tokio::task::unconstrained(self.cancel.as_mut().expect("polled after complete"));
let cancel = self.cancel.as_mut().expect("polled after complete");
// This MUST return ready as soon as the sender has been notified so
// that we don't return a service that has been canceled, so we disable
// cooperative scheduling on the receiver. Otherwise, the receiver can
// sporadically return pending even though the sender has fired.
let mut cancel = tokio::task::unconstrained(cancel);

Signed-off-by: Oliver Gould <ver@buoyant.io>
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this use of Notify is actually correct since we are dropping the Notified future at the end of the poll. i have a thought on what I think is the best way to do this, i can open a PR against this branch?

Comment on lines +404 to +406
/// A `tokio::sync::oneshot` is NOT used, as a `Receiver` is not guaranteed to
/// observe results as soon as a `Sender` fires. Using an `AtomicBool` allows
/// the state to be observed as soon as the cancelation is triggered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the comment

tower/src/ready_cache/cache.rs Outdated Show resolved Hide resolved
tower/src/ready_cache/cache.rs Outdated Show resolved Hide resolved
tower/src/ready_cache/cache.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member

hawkw commented Jun 17, 2022

@olix0r wdyt of #669 ?

hawkw and others added 3 commits June 17, 2022 10:01
This replaces the use of `tokio::sync::Notify` with
`futures::task::AtomicWaker`. `Notify` requires the `Notified` future to
be held in order to remain interested in the notification. Dropping the
future returned by `Notify::notified` cancels interest in the
notification. So, the current code using `Notify` is incorrect, as the
`Notified` future is created on each poll and then immediately dropped,
releasing interest in the wakeup.

We could solve this by storing the `Notify::notified` future in the
`Pending` future, but this would be a bit of a pain, as the `Notified`
future borrows the `Notify`. I thought that it was a nicer alternative
to just rewrite this code to use `AtomicWaker` instead.

Also, `AtomicWaker` is a much simpler, lower-level primitive. It simply
stores a single waker that can wake up a single task. `Notify` is
capable of waking multiple tasks, either in order or all at once,
which makes it more complex.
@hawkw hawkw marked this pull request as ready for review June 17, 2022 18:27
@hawkw hawkw changed the title ready-cache: Ensure cancelation updates can be observed ready-cache: Ensure cancelation can be observed Jun 17, 2022
Signed-off-by: Oliver Gould <ver@buoyant.io>
@hawkw hawkw merged commit edd922d into master Jun 17, 2022
@hawkw hawkw deleted the ver/rc-notify branch June 17, 2022 19:06
hawkw added a commit that referenced this pull request Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

This branch replaces the use of `tokio::sync::oneshot` for canceling 
pending futures with a custom cancelation handle using an `AtomicBool`
and `futures::task::AtomicWaker`. This ensures that canceled `Pending`
services are always woken even when the task's budget is exceeded.
Additionally, cancelation status is now always known to the `Pending`
future, by checking the `AtomicBool` immediately on polls, even in cases
where the canceled `Pending` future was woken by the inner `Service`
becoming ready, rather than by the cancelation.

Fixes #415

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Jun 17, 2022
`tokio::task` enforces a cooperative scheduling regime that can cause
`oneshot::Receiver::poll` to return pending after the sender has sent an
update. `ReadyCache` uses a oneshot to notify pending services that they
should not become ready. When a cancelation is not observed, the ready
cache return service instances that should have been canceled, which
breaks assumptions and causes an invalid state.

This branch replaces the use of `tokio::sync::oneshot` for canceling 
pending futures with a custom cancelation handle using an `AtomicBool`
and `futures::task::AtomicWaker`. This ensures that canceled `Pending`
services are always woken even when the task's budget is exceeded.
Additionally, cancelation status is now always known to the `Pending`
future, by checking the `AtomicBool` immediately on polls, even in cases
where the canceled `Pending` future was woken by the inner `Service`
becoming ready, rather than by the cancelation.

Fixes #415

Signed-off-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Jun 17, 2022
# 0.4.13 (June 17, 2022)

### Added

- **load_shed**: Public constructor for `Overloaded` error ([#661])

### Fixed

- **util**: Fix hang with `call_all` when the `Stream` of requests is
  pending ([#656])
- **ready_cache**: Ensure cancelation is observed by pending services
  ([#668], fixes [#415])
- **docs**: Fix a missing section header due to a typo ([#646])
- **docs**: Fix broken links to `Service` trait ([#659])

[#661]: #661
[#656]: #656
[#668]: #668
[#415]: #415
[#646]: #646
[#659]: #659
hawkw added a commit that referenced this pull request Jun 17, 2022
# 0.4.13 (June 17, 2022)

### Added

- **load_shed**: Public constructor for `Overloaded` error ([#661])

### Fixed

- **util**: Fix hang with `call_all` when the `Stream` of requests is
  pending ([#656])
- **ready_cache**: Ensure cancelation is observed by pending services
  ([#668], fixes [#415])
- **docs**: Fix a missing section header due to a typo ([#646])
- **docs**: Fix broken links to `Service` trait ([#659])

[#661]: #661
[#656]: #656
[#668]: #668
[#415]: #415
[#646]: #646
[#659]: #659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panicked at 'missing cancelation' in tower-ready-cache
2 participants