Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outbound: implement OutboundPolicy retries #2446

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jul 26, 2023

Depends on linkerd/linkerd2-proxy-api#256

This branch changes the outbound proxy to honor the retry configurations
added to the proxy API in linkerd/linkerd2-proxy-api#256. In particular,
this involves the following:

  • Handling the new retry configuration messages in the
    OutboundPolicies client and converting them to internal
    representations,
  • Changing the way the outbound proxy configures its retry middleware to
    support both ServiceProfiles and OutboundPolicies-based retry
    configurations,
  • Tracking the number of times a given request has been retried in the
    retry middleware, in order to support the per-request retry limit
    provided by the proxy API,
  • Actually adding a retry layer in the outbound HTTP policy stack's
    MatchedRoute stack,
  • Testing

In addition, in order to support the intended behavior for the
timeouts.backendRequest timeout configured on HTTPRoutes, where
requests which hit this timeout can be retried if 504 status codes are
retryable, it was also necessary to add a small errors::Respond
middleware to the MatchedBackend stack, above the backend-request
timeout. This is because our retry policies currently only retry failure
classifications that are generated from real responses (i.e., failure
status codes), and do not retry any Rust Errs:

let retryable = match result {
Err(_) => false,
Ok(rsp) => {
// is the request a failure?
let is_failure = classify::Request::from(self.response_classes.clone())
.classify(req)
.start(rsp)
.eos(rsp.body().trailers())
.is_failure();
// did the body exceed the maximum length limit?
let exceeded_max_len = req.body().is_capped();
let retryable = is_failure && !exceeded_max_len;
tracing::trace!(is_failure, exceeded_max_len, retryable);
retryable
}
};

Since Errors generated by the backend request timeout are currently
converted to synthesized HTTP error responses by the ServerRescue
layer, near the top of the outbound proxy stack, these timeouts are
still Errs when encountered by the retry layer in the MatchedRoute
stack. In order to retry backend request timeouts, we add a very small
error response layer that just synthesizes HTTP 504s from the timeout
error returned by the backend request timeout middleware.

There are a couple of alternatives to this approach:

  • Moving the actual timeout layer for the backend-request timeout to be
    below the ClientRescue layer in the endpoint stack, like I initially
    proposed in outbound: implement OutboundPolicies backend request timeouts #2419. This would allow these errors to be converted to
    synthesized HTTP responses long before they're seen by the retry
    layer. However, the code for applying this timeout in the endpoint
    stack, which involved storing the timeout in a request extension in
    the MatchedBackend stack (where we actually know the configured
    timeout value), was considered too complex previously, so I didn't
    want to go with this approach.
  • Changing the retry policy so that some Err(Error) values can be
    retried, in addition to Ok(Response)s which are considered failures.
    In order to allow the user to configure which failures are retryable
    using the current configuration interface, which is status-code-based,
    this would require making our StatusRanges response classifier aware
    of how Errors map to HTTP status codes...which is what the
    errors::Respond middleware already does when synthesizing error
    responses. So, it seemed nicer to not duplicate that information.

However, I'm open to alternative suggestions for this.

Comment on lines 577 to 581
// these tests doesn't currently work as we would expect, because backend
// request timeouts are not turned into synthesized 504s until a layer above the
// retry layer. instead of being `Ok(Response)`s with a failure status, they're
// `Err(Error)`s when the retry layer sees them, so they don't get retried.
// TODO(eliza): figure that out
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will require either moving the actual backend_request timeout below the ClientRescue layer, so that the timeout can be turned into a 504 (like I initially wanted to do in #2419 before I was talked out of it), or adding a special errors::respond layer just for turning that timeout into a 504, which feels gross...

@hawkw hawkw marked this pull request as ready for review July 27, 2023 19:42
@hawkw hawkw requested a review from a team as a code owner July 27, 2023 19:42
@hawkw hawkw changed the title [WIP] implement OutboundPolicy retries outbound: implement OutboundPolicy retries Jul 27, 2023
@hawkw hawkw requested review from olix0r and adleong July 27, 2023 19:42
///
/// This is necessary because we want these timeouts to be retried, and the
/// retry layer only retries requests that fail with an HTTP status code, rather
/// than for requests that fail with Rust `Err`s. Errors returned by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?

Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?

This is a good point, IIRC, I don't think there's any particular reason for this to be in the backend stack rather than in the route stack right below the retry layer. I'll see about moving it.

Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?

Regarding error metrics, I believe we don't currently track error metrics per-backend, only request count metrics. Therefore, the timeout will only be recorded by error metrics if it is not retried and the request fails. Because the design for HTTPRoute outbound policy metrics is still ongoing, I didn't implement any new metrics as part of this PR. When we do add HTTPRoute-specific error metrics, I would expect these timeouts to be recorded by any per-backend error counters but not any per-route error counters if the timeout is retried.

Comment on lines 149 to 153
// Eagerly synthesize 504 responses for backend_request timeout
// errors.
// See the doc comment on `TimeoutRescue` for details.
.push(TimeoutRescue::layer())
.push_on_service(http::BoxResponse::<_>::layer())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this change but highlights something that may appear spooky to inner layers: the timeout layer will simply drop inner request futures, so inner layers (like the balancer) need to handle being dropped as a timeout of some sort. It would be better for us to move the timeout stack down into the endpoint (probably by fetching timeouts out of a request extension).

Copy link
Member Author

@hawkw hawkw Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving timeouts into the endpoint stack using a request extension was the initial design I used for #2419 (see commit 64aca6a).

At the time, the conclusion in code review was that this approach was too complex. I'd be happy to bring that back, but I think it would make sense to do so in a separate PR?


pub fn layer<N>(
metrics: metrics::HttpProfileRouteRetry,
metrics: Option<metrics::HttpProfileRouteRetry>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the existing retry metrics are a bit odd, but do we understand what visibility we are sacrificing by not porting them forward? What questions can we no longer answer for policy retries?

@olix0r olix0r marked this pull request as draft November 17, 2023 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants