Default retry behaviour includes HTTP 4xx client errors #833

rahim · 2023-09-25T23:14:07Z

See fog/fog-aws#690 for background on how I stumbled on this. I opened a PR in fog-aws to configure round this, but it feels like something that might be better addressed here.

#690 added the ability to configure which errors were retried.

Excon sets the following for retry_errors

excon/lib/excon/constants.rb

Lines 19 to 23 in 85556ae

    
           DEFAULT_RETRY_ERRORS = [ 
        
             Excon::Error::Timeout, 
        
             Excon::Error::Socket, 
        
             Excon::Error::HTTPStatus 
        
           ]

Client, Success, Redirection, Informational and Server are all descendants of Excon::Error::HTTPStatus.

excon/lib/excon/error.rb

Lines 68 to 125 in 85556ae

    
           class HTTPStatus < Error 
        
             attr_reader :request, :response 
        
             def initialize(msg, request = nil, response = nil) 
        
               super(msg) 
        
               @request = request 
        
               @response = response 
        
             end 
        
           end 
        
           # HTTP Error classes 
        
           class Informational < HTTPStatus; end 
        
           class Success < HTTPStatus; end 
        
           class Redirection < HTTPStatus; end 
        
           class Client < HTTPStatus; end 
        
           class Server < HTTPStatus; end 
        
           class Continue < Informational; end                  # 100 
        
           class SwitchingProtocols < Informational; end        # 101 
        
           class OK < Success; end                              # 200 
        
           class Created < Success; end                         # 201 
        
           class Accepted < Success; end                        # 202 
        
           class NonAuthoritativeInformation < Success; end     # 203 
        
           class NoContent < Success; end                       # 204 
        
           class ResetContent < Success; end                    # 205 
        
           class PartialContent < Success; end                  # 206 
        
           class MultipleChoices < Redirection; end             # 300 
        
           class MovedPermanently < Redirection; end            # 301 
        
           class Found < Redirection; end                       # 302 
        
           class SeeOther < Redirection; end                    # 303 
        
           class NotModified < Redirection; end                 # 304 
        
           class UseProxy < Redirection; end                    # 305 
        
           class TemporaryRedirect < Redirection; end           # 307 
        
           class BadRequest < Client; end                       # 400 
        
           class Unauthorized < Client; end                     # 401 
        
           class PaymentRequired < Client; end                  # 402 
        
           class Forbidden < Client; end                        # 403 
        
           class NotFound < Client; end                         # 404 
        
           class MethodNotAllowed < Client; end                 # 405 
        
           class NotAcceptable < Client; end                    # 406 
        
           class ProxyAuthenticationRequired < Client; end      # 407 
        
           class RequestTimeout < Client; end                   # 408 
        
           class Conflict < Client; end                         # 409 
        
           class Gone < Client; end                             # 410 
        
           class LengthRequired < Client; end                   # 411 
        
           class PreconditionFailed < Client; end               # 412 
        
           class RequestEntityTooLarge < Client; end            # 413 
        
           class RequestURITooLong < Client; end                # 414 
        
           class UnsupportedMediaType < Client; end             # 415 
        
           class RequestedRangeNotSatisfiable < Client; end     # 416 
        
           class ExpectationFailed < Client; end                # 417 
        
           class UnprocessableEntity < Client; end              # 422 
        
           class TooManyRequests < Client; end                  # 429 
        
           class InternalServerError < Server; end              # 500 
        
           class NotImplemented < Server; end                   # 501 
        
           class BadGateway < Server; end                       # 502 
        
           class ServiceUnavailable < Server; end               # 503 
        
           class GatewayTimeout < Server; end                   # 504

Incomplete tree for some additional context:

StandardError
└──Excon::Error
   ├──Warning
   ├──Socket
   │  └──Certificate
   ├──InvalidHeaderKey
   ├──InvalidHeaderValue
   ├──Timeout
   ├──ResponseParse
   ├──ProxyConnectionError
   ├──ProxyParse
   ├──TooManyRedirects
   └──HttpStatus
      ├──Informational
      ├──Success
      │  ├──OK
      │  ├──Created
      │  ├──Accepted
      │  └──[...]
      ├──Redirection
      ├──Client
      │  ├──BadRequest
      │  ├──Unauthorized
      │  ├──PaymentRequired
      │  ├──Forbidden
      │  ├──NotFound
      │  ├──MethodNotAllowed
      │  ├──NotAcceptable
      │  ├──ProxyAuthenticationRequired
      │  ├──RequestTimeout
      │  ├──Conflict
      │  ├──Gone
      │  └──[...]
      └──Server
         ├──InternalServerError
         ├──BadGateway
         ├──ServiceUnavailable
         └──GatewayTimeout

In terms of descendants of HttpStatus I struggle to think of situations where retrying anything other than Excon::Error::Server and perhaps Excon::RequestTimeout and Excon::TooManyRequests could be useful. Blanketing the failures is certainly the conservative choice here though, there's surely scenarios I've just not had the imagination to think up.

When retry_interval is 0 (as is the case in Excon's default configuration), these choices tend to have minimal impact because repeat queries are typically so fast, but when the interval is increased waiting on retries for scenarios we have no reason to believe would change becomes a cost.

My suggestion is that we set

  DEFAULT_RETRY_ERRORS = [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::Server
  ]

Possibly also including Excon::RequestTimeout and Excon::TooManyRequests.

I'm happy to open a PR if there's agreement this is a reasonable approach.

The text was updated successfully, but these errors were encountered:

geemus · 2023-09-26T15:13:00Z

Hey, I mentioned this on one of the fog-aws issues, but I think the main potential use case that I can recall is to smooth eventual consistency. ie a NotFound is retried, with some delay, making it so the end user doesn't have to necessarily deal with that themselves. (or at least I believe that was my intention back then).

To your points though, there are a few problems with this.

Although this behavior might be helpful, the implementation is much broader than this and likely retries many things that would always fail, wasting time and effort. So even if we decided to keep this behavior, we would might want to narrow it to only apply to the not found case.
The not found/eventual consistency case is not actually that common in the wild I don't think. Some of how this is probably reflects that this was originally just a portion of fog, which initially only supported AWS (where eventual consistency was definitely something you have to regularly contend with). Most things are not aws though, and so it might warrant different defaults in the general case vs the aws case.
The appropriate intervals for the eventual consistency case and other cases may be quite different. ie it may make sense for no delay in some cases, but for eventual consistency and TooManyRequests something like exponential backoff might be more appropriate. Maybe that means these should be broken out or treated differently.

It's possible that the right approach might be to separate out the eventual consistency stuff and handle that specifically in fog-aws (and not here where it's rarely applicable). That could also give us more flexibility/leverage to make it configurable.

What do you think?

rahim · 2023-09-28T09:14:51Z

It's possible that the right approach might be to separate out the eventual consistency stuff and handle that specifically in fog-aws (and not here where it's rarely applicable). That could also give us more flexibility/leverage to make it configurable.

What do you think?

That sounds a promising path - I'd been starting to think along similar lines – that trying to cover off the eventual consistency case as the default behaviour in a HTTP library feels like overstretching.

Default retry of anything is somewhat surprising when compared with other HTTP libraries, to my knowledge, all of net-http/httparty/faraday/curl/python-requests/js-fetch/axios require some form of explicit opt-in to get any retry.

At the fog service level, where waiting on eventual consistency is something that might be desirable, it feels like something that would benefit from specific affordances that call out exactly what's going on.

A good comparison is the AWS Ruby SDK's waiters pattern:

ec2 = Aws::EC2::Client.new
ec2.wait_until(:instance_running, instance_ids:['i-12345678'])

geemus · 2023-09-29T13:11:51Z

@rahim yeah, in fog there is the somewhat similar Fog.wait_for though it isn't always used. I also agree that removing the client errors from retries sounds like a good fix in general (and even more so given our discussion in fog-aws about the state of everything).

I don't think there should be any retries by default though. The middleware is loaded by default, but I believe the actual retry is based on the truthiness of :idempotent, which defaults to false (see: https://github.com/excon/excon/blob/master/lib/excon/constants.rb#L151). So in that regard, I guess it's inline with other libraries, though I'm not sure they have a similar notion of idempotence and retries at all. Does that make sense?

rahim · 2023-09-29T16:13:25Z

I don't think there should be any retries by default though. The middleware is loaded by default, but I believe the actual retry is based on the truthiness of :idempotent, which defaults to false

Ah, thank you - that was a detail I'd missed.

When I was reviewing the various fog retry behaviours it sounds like I ought to have been grepping for where that idempotent flag got set too.

geemus · 2023-09-29T16:55:10Z

No worries, it's pretty easy to overlook I think.

github-actions · 2023-11-29T01:47:31Z

This issue has been marked inactive and will be closed if no further activity occurs.

rahim mentioned this issue Sep 26, 2023

Fog::AWS::Storage don't retry client errors fog/fog-aws#691

Merged

rahim mentioned this issue Sep 28, 2023

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses fog/fog-aws#690

Closed

github-actions bot added the no-issue-activity label Nov 29, 2023

geemus added pinned and removed no-issue-activity labels Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default retry behaviour includes HTTP 4xx client errors #833

Default retry behaviour includes HTTP 4xx client errors #833

rahim commented Sep 25, 2023 •

edited

geemus commented Sep 26, 2023

rahim commented Sep 28, 2023

geemus commented Sep 29, 2023

rahim commented Sep 29, 2023

geemus commented Sep 29, 2023

github-actions bot commented Nov 29, 2023

Default retry behaviour includes HTTP 4xx client errors #833

Default retry behaviour includes HTTP 4xx client errors #833

Comments

rahim commented Sep 25, 2023 • edited

geemus commented Sep 26, 2023

rahim commented Sep 28, 2023

geemus commented Sep 29, 2023

rahim commented Sep 29, 2023

geemus commented Sep 29, 2023

github-actions bot commented Nov 29, 2023

rahim commented Sep 25, 2023 •

edited