outbound: implement OutboundPolicy retries #2446

hawkw · 2023-07-26T23:21:54Z

Depends on linkerd/linkerd2-proxy-api#256

This branch changes the outbound proxy to honor the retry configurations
added to the proxy API in linkerd/linkerd2-proxy-api#256. In particular,
this involves the following:

Handling the new retry configuration messages in the
OutboundPolicies client and converting them to internal
representations,
Changing the way the outbound proxy configures its retry middleware to
support both ServiceProfiles and OutboundPolicies-based retry
configurations,
Tracking the number of times a given request has been retried in the
retry middleware, in order to support the per-request retry limit
provided by the proxy API,
Actually adding a retry layer in the outbound HTTP policy stack's
MatchedRoute stack,
Testing

In addition, in order to support the intended behavior for the
timeouts.backendRequest timeout configured on HTTPRoutes, where
requests which hit this timeout can be retried if 504 status codes are
retryable, it was also necessary to add a small errors::Respond
middleware to the MatchedBackend stack, above the backend-request
timeout. This is because our retry policies currently only retry failure
classifications that are generated from real responses (i.e., failure
status codes), and do not retry any Rust Errs:

linkerd2-proxy/linkerd/app/outbound/src/http/retry.rs

Lines 85 to 100 in 3224560

    
           let retryable = match result { 
        
               Err(_) => false, 
        
               Ok(rsp) => { 
        
                   // is the request a failure? 
        
                   let is_failure = classify::Request::from(self.response_classes.clone()) 
        
                       .classify(req) 
        
                       .start(rsp) 
        
                       .eos(rsp.body().trailers()) 
        
                       .is_failure(); 
        
                   // did the body exceed the maximum length limit? 
        
                   let exceeded_max_len = req.body().is_capped(); 
        
                   let retryable = is_failure && !exceeded_max_len; 
        
                   tracing::trace!(is_failure, exceeded_max_len, retryable); 
        
                   retryable 
        
               } 
        
           };

Since Errors generated by the backend request timeout are currently
converted to synthesized HTTP error responses by the ServerRescue
layer, near the top of the outbound proxy stack, these timeouts are
still Errs when encountered by the retry layer in the MatchedRoute
stack. In order to retry backend request timeouts, we add a very small
error response layer that just synthesizes HTTP 504s from the timeout
error returned by the backend request timeout middleware.

There are a couple of alternatives to this approach:

Moving the actual timeout layer for the backend-request timeout to be
below the ClientRescue layer in the endpoint stack, like I initially
proposed in outbound: implement OutboundPolicies backend request timeouts #2419. This would allow these errors to be converted to
synthesized HTTP responses long before they're seen by the retry
layer. However, the code for applying this timeout in the endpoint
stack, which involved storing the timeout in a request extension in
the MatchedBackend stack (where we actually know the configured
timeout value), was considered too complex previously, so I didn't
want to go with this approach.
Changing the retry policy so that some Err(Error) values can be
retried, in addition to Ok(Response)s which are considered failures.
In order to allow the user to configure which failures are retryable
using the current configuration interface, which is status-code-based,
this would require making our StatusRanges response classifier aware
of how Errors map to HTTP status codes...which is what the
errors::Respond middleware already does when synthesizing error
responses. So, it seemed nicer to not duplicate that information.

However, I'm open to alternative suggestions for this.

hawkw · 2023-07-26T23:25:27Z

linkerd/app/integration/src/tests/client_policy.rs

+// these tests doesn't currently work as we would expect, because backend
+// request timeouts are not turned into synthesized 504s until a layer above the
+// retry layer. instead of being `Ok(Response)`s with a failure status, they're
+// `Err(Error)`s when the retry layer sees them, so they don't get retried.
+// TODO(eliza): figure that out


this will require either moving the actual backend_request timeout below the ClientRescue layer, so that the timeout can be turned into a 504 (like I initially wanted to do in #2419 before I was talked out of it), or adding a special errors::respond layer just for turning that timeout into a 504, which feels gross...

olix0r · 2023-08-14T23:27:49Z

linkerd/app/outbound/src/http/logical/policy/route/backend.rs

+///
+/// This is necessary because we want these timeouts to be retried, and the
+/// retry layer only retries requests that fail with an HTTP status code, rather
+/// than for requests that fail with Rust `Err`s. Errors returned by the


If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?

Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?

If we're doing this to satisfy a deficiency of the retry layer, is there any reason that we can't move this mapping closer to that decision? I.e. why does this belong in the backend stack? Why does the backend stack care about the retry layer?

This is a good point, IIRC, I don't think there's any particular reason for this to be in the backend stack rather than in the route stack right below the retry layer. I'll see about moving it.

Furthermore, how does this interact with error metrics? How/where are request timeout errors recorded? Is it consistent with other error metrics?

Regarding error metrics, I believe we don't currently track error metrics per-backend, only request count metrics. Therefore, the timeout will only be recorded by error metrics if it is not retried and the request fails. Because the design for HTTPRoute outbound policy metrics is still ongoing, I didn't implement any new metrics as part of this PR. When we do add HTTPRoute-specific error metrics, I would expect these timeouts to be recorded by any per-backend error counters but not any per-route error counters if the timeout is retried.

olix0r · 2023-08-14T23:30:35Z

linkerd/app/outbound/src/http/logical/policy/route/backend.rs

+                // Eagerly synthesize 504 responses for backend_request timeout
+                // errors.
+                // See the doc comment on `TimeoutRescue` for details.
+                .push(TimeoutRescue::layer())
+                .push_on_service(http::BoxResponse::<_>::layer())


This is unrelated to this change but highlights something that may appear spooky to inner layers: the timeout layer will simply drop inner request futures, so inner layers (like the balancer) need to handle being dropped as a timeout of some sort. It would be better for us to move the timeout stack down into the endpoint (probably by fetching timeouts out of a request extension).

Moving timeouts into the endpoint stack using a request extension was the initial design I used for #2419 (see commit 64aca6a).

At the time, the conclusion in code review was that this approach was too complex. I'd be happy to bring that back, but I think it would make sense to do so in a separate PR?

olix0r · 2023-08-14T23:33:13Z

linkerd/app/outbound/src/http/retry.rs


 pub fn layer<N>(
-    metrics: metrics::HttpProfileRouteRetry,
+    metrics: Option<metrics::HttpProfileRouteRetry>,


I know that the existing retry metrics are a bit odd, but do we understand what visibility we are sacrificing by not porting them forward? What questions can we no longer answer for policy retries?

hawkw added 14 commits July 25, 2023 13:13

proxy-api plumbing

7c8c734

BLBLBLBLBLBLGGHGHGHGH

600d8ab

testy fixy

86d618d

generify

54adb08

more stuff

5efdbd0

wip

4dd80a3

plumbing (mostly) done

47249a5

okay now it's good

3043d96

okay cool it compiles for real now

78a6420

add basic retry test

faf4eda

fix wrong retry budget range minimum

36ec704

more retry tests

64f5eec

add post body tests

18ac612

add tests for timeout interaction

f43c88f

hawkw commented Jul 26, 2023

View reviewed changes

hawkw added 9 commits July 27, 2023 09:28

fix request timeout not actually getting hit in test

95bc567

handle backend request timeouts as 504s

9a0cf7f

actually propagate EmitHeaders config

d9960c8

self-review: undo unneeded diff

30a8ab4

self-review cleanup

c31fc6d

move retry tests into their own module

a7fe063

document tests

7b1529a

fix missing client handles in tests

528d143

fix backend timeout test

7a7b88c

hawkw marked this pull request as ready for review July 27, 2023 19:42

hawkw requested a review from a team as a code owner July 27, 2023 19:42

hawkw changed the title ~~[WIP] implement OutboundPolicy retries~~ outbound: implement OutboundPolicy retries Jul 27, 2023

hawkw requested review from olix0r and adleong July 27, 2023 19:42

hawkw mentioned this pull request Aug 1, 2023

policy: Add support for HTTPRetryFilter linkerd/linkerd2#11117

Closed

update lockfile

119a7fb

olix0r reviewed Aug 14, 2023

View reviewed changes

hawkw added 6 commits August 15, 2023 09:33

move TimeoutRescue into route stack

b363f14

remove timeout error special-casing

3adbc31

Merge branch 'main' into eliza/httproute-retries

e7df13a

update lockfile

6099ebf

whoops i accidentally removed the webpki patch

b90c5a9

remove stuff related to error synthesis

6975ae9

hawkw force-pushed the eliza/httproute-retries branch from aa79e92 to 6975ae9 Compare August 30, 2023 20:49

oh i messed up the cargo.toml merge...

3cca5bd

olix0r marked this pull request as draft November 17, 2023 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outbound: implement OutboundPolicy retries #2446

outbound: implement OutboundPolicy retries #2446

hawkw commented Jul 26, 2023 •

edited

hawkw Jul 26, 2023

olix0r Aug 14, 2023

hawkw Aug 15, 2023

olix0r Aug 14, 2023

hawkw Aug 14, 2023 •

edited

olix0r Aug 14, 2023

	let retryable = match result {
	Err(_) => false,
	Ok(rsp) => {
	// is the request a failure?
	let is_failure = classify::Request::from(self.response_classes.clone())
	.classify(req)
	.start(rsp)
	.eos(rsp.body().trailers())
	.is_failure();
	// did the body exceed the maximum length limit?
	let exceeded_max_len = req.body().is_capped();
	let retryable = is_failure && !exceeded_max_len;
	tracing::trace!(is_failure, exceeded_max_len, retryable);
	retryable
	}
	};

outbound: implement OutboundPolicy retries #2446

Are you sure you want to change the base?

outbound: implement OutboundPolicy retries #2446

Conversation

hawkw commented Jul 26, 2023 • edited

hawkw Jul 26, 2023

Choose a reason for hiding this comment

olix0r Aug 14, 2023

Choose a reason for hiding this comment

hawkw Aug 15, 2023

Choose a reason for hiding this comment

olix0r Aug 14, 2023

Choose a reason for hiding this comment

hawkw Aug 14, 2023 • edited

Choose a reason for hiding this comment

olix0r Aug 14, 2023

Choose a reason for hiding this comment

hawkw commented Jul 26, 2023 •

edited

hawkw Aug 14, 2023 •

edited