Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

rahim · 2023-09-25T21:35:38Z

The default retry configuration for Fog::AWS::Storage

Line 549 in b073ba8

    
           @connection_options     = options[:connection_options] || { retry_limit: 5, retry_interval: 1 }

combined with Excon's defaults leads to a situation where a GET of an S3 object that may or may not exist will take 6 seconds to answer "no - the object was not found".

That retry configuration was added in #674 solving for a batch upload scenario.

This can be reproduced with something like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
puts Benchmark.measure { file = bucket.files.get('some-non-existent-object') }

I have a script that does that for various retry configurations, example results:

Fog/Excon defaults (retry_limit: 5, retry_interval: 1)

  0.227765   0.018269   0.246034 (  6.307515)


Using retry_limit: 5, retry_interval: 0.1

  0.242438   0.018477   0.260915 (  2.534994)


Using retry_limit: 1, retry_interval: 1

  0.046131   0.003609   0.049740 (  0.457492)


Using retry_errors: [Excon::Error::Timeout,Excon::Error::Socket,Excon::Error::Server]

  0.049035   0.003061   0.052096 (  0.422489)

I noticed this because we were using an old version of fog that predated the changes to retry configuration. We back ported just that configuration and observed significant performance regression on a couple of endpoints where the probability of a GET to an S3 object that didn't exist were high.

Excon sets the following for retry_errors https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/constants.rb#L19-L23

  DEFAULT_RETRY_ERRORS = [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::HTTPStatus
  ]

Somewhat weirdly, the inheritance tree for successful responses has them as descendants of Excon::Error. https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/error.rb#L68-L125

Incomplete tree for some additional context:

StandardError
└──Excon::Error
   ├──Warning
   ├──Socket
   │  └──Certificate
   ├──InvalidHeaderKey
   ├──InvalidHeaderValue
   ├──Timeout
   ├──ResponseParse
   ├──ProxyConnectionError
   ├──ProxyParse
   ├──TooManyRedirects
   └──HttpStatus
      ├──Informational
      ├──Success
      │  ├──OK
      │  ├──Created
      │  ├──Accepted
      │  └──[...]
      ├──Redirection
      ├──Client
      │  ├──BadRequest
      │  ├──Unauthorized
      │  ├──PaymentRequired
      │  ├──Forbidden
      │  ├──NotFound
      │  ├──MethodNotAllowed
      │  ├──NotAcceptable
      │  ├──ProxyAuthenticationRequired
      │  ├──RequestTimeout
      │  ├──Conflict
      │  ├──Gone
      │  └──[...]
      └──Server
         ├──InternalServerError
         ├──BadGateway
         ├──ServiceUnavailable
         └──GatewayTimeout

It's the inclusion of Excon::Error::Client here that I question, almost all of this class of error seem like problems that won't be helped with a retry, perhaps with the exception of RequestTimeout and TooManyRequests and all of these will behave similarly to NotFound, causing performance to regress from <20ms to >6000ms.

I'll open up a PR here to patch the default configuration of retry_errors in this project, but it's certainly debatable whether the defaults being inherited are something that should be changed upstream in Excon itself.

I do also wonder on reflection whether retry_limit: 5, retry_interval: 1 solved too narrowly for one use case - 1 second is an awfully long time in some contexts, particularly when response times for some S3 calls could be only a few milliseconds.

The text was updated successfully, but these errors were encountered:

Resolves fog#690

geemus · 2023-09-26T14:53:28Z

Thanks for the detailed description. It's been some time, but if I recall correctly, I think this was intended to provide ease of use around the eventually consistent nature of S3. ie for a use case like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
file = bucket.files.create(key: 'some-new-object', value: 'some-new-value')
puts bucket.files.get('some-non-existent-object')

The idea being that this helps avoid that raising an error if you have reason to believe it is already there. This doesn't really consider the case where that is being used to check for existence though, which as you suggest is made much worse for this.

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case), and this may also be why the retries are somewhat long (as this tends to take somewhat longer to resolve).

It seems like we may want this to be more tunable and/or different for different calls, but I'm not entirely sure what the best approach is to maintain existing behavior and fix for this case. I wonder if maybe we should use HEAD for existence checks instead, which would give us a place to apply different settings, for instance. It also has the benefit of allowing existence checks to be faster/cheaper, even for rather large objects. What do you think? I'm happy to discuss, and will look at the PRs presently.

rahim · 2023-09-26T18:24:09Z

I think this was intended to provide ease of use around the eventually consistent nature of S3

💡

That's not something that had occurred to me.

I'll have a think, like you say, it's not immediately clear what the best approach is.

rajyan · 2023-09-26T23:49:47Z

First of all, thank you for opening this issue.
I wasn't aware enough of the cause with the change in #675, because our service mainly uploads bunch of files, and we never access to 404 keys.

For the problems that @geemus is mentioning, AWS S3 now supports strong consistency, and I believe that 404 scenario won't happen any more (I will test it later).

related articles
https://aws.amazon.com/jp/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
https://markdboyd.medium.com/how-to-handle-eventual-consistency-with-s3-5cfbe97d1f18
https://saturncloud.io/blog/understanding-the-amazon-s3-data-consistency-model/

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case)

so, I now vote for configuring defaults to only retry on non client errors.

rajyan · 2023-09-27T00:03:39Z

For the hierarchy of HtttpStatus > Success, I found an explanation here
excon/excon#694

geemus · 2023-09-27T16:00:47Z

Oh, interesting. I definitely missed the memo on strong consistency. That does seem to remove the need for client error retries (or at least I don't recall any other real reason for it).

geemus · 2023-09-27T16:03:32Z

Although, that being said. I think other AWS services historically had similar issues at times. ie I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

rajyan · 2023-09-27T23:01:43Z

@geemus

Found this https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
and it's seems that EC2 API follows an eventual consistent model.

I think EC2 wouldn't always return objects after create.

so this is still true for EC2.

Although, isn't this https://github.com/fog/fog-aws/pull/691/files config only related to S3?

rajyan · 2023-09-27T23:17:10Z

For additional information boto retries on

https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/retryhandler.py#L105-L122

general errors
https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L100-L111

and some specific s3 client errors
https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L220-L235

geemus · 2023-09-27T23:57:46Z

@rajyan Great find on those docs, good point on the change only impacting S3, and thanks for the boto examples. Your references have been super helpful in thinking through this and figuring out a way forward, I really appreciate the help.

It seems like it's becoming clearer and clearer that this would be a good safe change, at least for S3 (and quite possibly for excon, though we might then want to add client errors for the eventual consistency stuff into ec2). I think in the EC2 case, even though there could be similar performance issues to what we've been discussing, the use cases are SO much different that it probably still makes sense.

I will try to circle back to this, hopefully tomorrow if not in the next couple days, and make a call about getting some of this in. It's just getting late in the day and I have a one month old in the house, so I want to make sure I can come to this a bit fresher with more focus/energy.

rahim · 2023-09-28T11:21:09Z

@rajyan thank you for the links (both on S3 consistency and boto's behaviour) - both really useful.

boto's declarative approach is very nice, it's notable that:

for S3 it does not retry on 404
it has affordance for configuring delay, including support for exponential backoff
where retry applies, the base default includes exponential backoff and some jitter

It feels like exponential backoff support would be a nice addition to Excon,
possibly a good default when users of Excon choose retry behaviour.

I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

@geemus https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency was the most relevant thing I found, it certainly reads that way.

It's worth noting the default retry behaviour as it currently stands:

Excon:
retry_limit: 4, retry_interval: 0, retry_errors: [Timeout, Socket, HTTPStatus]
(retry depends on Idempotent middleware being active)
=> a very broad range of errors will be retried as fast as possible up to four times, for many services this will all happen so quickly as to have minimal impact on overall response time, but for any situation where the remedy is "wait a bit longer" is likely be minimally helpful.
fog-aws:
mostly inherits the Excon defaults, with exceptions for:
- CredentialFetcher#fetch_credentials_token_header (retry_limit: 3, retry_interval: 1)
- Storage (retry_limit: 5, retry_interval: 1)
- Storage::Directory#destroy! implements a wait for consistency using Fog#wait_for (that was a discovery for me, I was talking about similar over in Default retry behaviour includes HTTP 4xx client errors excon/excon#833 (comment))
- Compute implements its own retry with jitter specifically to handle RequestLimitExceeded
other fog cloud libraries:
From some very basic grepping, it looks like these mostly manually reimplement retry in the few cases where someone's decided it matters for the service at hand. Mostly, they're just inheriting the Excon default behaviour.

It's just getting late in the day and I have a one month old in the house

Congratulations! 😄

That's far more important than this, I hope you get some sleep!

* Fog::AWS::Storage don't retry client errors Resolves #690 * Fog::AWS::Storage merge connection options

geemus · 2023-09-29T13:06:46Z

@rahim thanks for pulling together that overview of the current defaults.

I've merged the suggested changes for S3 in particular at least, thanks to everyone for contributing to the discussion and talking through it.

rahim added a commit to rahim/fog-aws that referenced this issue Sep 25, 2023

Fog::AWS::Storage don't retry client errors

6ad879c

Resolves fog#690

This was referenced Sep 25, 2023

Fog::AWS::Storage don't retry client errors #691

Merged

Default retry behaviour includes HTTP 4xx client errors excon/excon#833

Open

geemus closed this as completed in #691 Sep 29, 2023

geemus pushed a commit that referenced this issue Sep 29, 2023

Fog::AWS::Storage don't retry client errors (#691)

cc42f32

* Fog::AWS::Storage don't retry client errors Resolves #690 * Fog::AWS::Storage merge connection options

rajyan mentioned this issue Sep 29, 2023

Carrierwave 3 extremely slow (retrieve_from_store! calling s3 over and over) carrierwaveuploader/carrierwave#2697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

rahim commented Sep 25, 2023 •

edited

geemus commented Sep 26, 2023

rahim commented Sep 26, 2023

rajyan commented Sep 26, 2023 •

edited

rajyan commented Sep 27, 2023

geemus commented Sep 27, 2023

geemus commented Sep 27, 2023

rajyan commented Sep 27, 2023 •

edited

rajyan commented Sep 27, 2023

geemus commented Sep 27, 2023

rahim commented Sep 28, 2023

geemus commented Sep 29, 2023

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

Comments

rahim commented Sep 25, 2023 • edited

geemus commented Sep 26, 2023

rahim commented Sep 26, 2023

rajyan commented Sep 26, 2023 • edited

rajyan commented Sep 27, 2023

geemus commented Sep 27, 2023

geemus commented Sep 27, 2023

rajyan commented Sep 27, 2023 • edited

rajyan commented Sep 27, 2023

geemus commented Sep 27, 2023

rahim commented Sep 28, 2023

geemus commented Sep 29, 2023

rahim commented Sep 25, 2023 •

edited

rajyan commented Sep 26, 2023 •

edited

rajyan commented Sep 27, 2023 •

edited