Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

Closed
rahim opened this issue Sep 25, 2023 · 11 comments · Fixed by #691
Closed

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690

rahim opened this issue Sep 25, 2023 · 11 comments · Fixed by #691

Comments

@rahim
Copy link
Contributor

rahim commented Sep 25, 2023

The default retry configuration for Fog::AWS::Storage

@connection_options = options[:connection_options] || { retry_limit: 5, retry_interval: 1 }

combined with Excon's defaults leads to a situation where a GET of an S3 object that may or may not exist will take 6 seconds to answer "no - the object was not found".

That retry configuration was added in #674 solving for a batch upload scenario.

This can be reproduced with something like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
puts Benchmark.measure { file = bucket.files.get('some-non-existent-object') }

I have a script that does that for various retry configurations, example results:

Fog/Excon defaults (retry_limit: 5, retry_interval: 1)

  0.227765   0.018269   0.246034 (  6.307515)


Using retry_limit: 5, retry_interval: 0.1

  0.242438   0.018477   0.260915 (  2.534994)


Using retry_limit: 1, retry_interval: 1

  0.046131   0.003609   0.049740 (  0.457492)


Using retry_errors: [Excon::Error::Timeout,Excon::Error::Socket,Excon::Error::Server]

  0.049035   0.003061   0.052096 (  0.422489)

I noticed this because we were using an old version of fog that predated the changes to retry configuration. We back ported just that configuration and observed significant performance regression on a couple of endpoints where the probability of a GET to an S3 object that didn't exist were high.

Excon sets the following for retry_errors https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/constants.rb#L19-L23

  DEFAULT_RETRY_ERRORS = [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::HTTPStatus
  ]

Somewhat weirdly, the inheritance tree for successful responses has them as descendants of Excon::Error. https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/error.rb#L68-L125

Incomplete tree for some additional context:

StandardError
└──Excon::Error
   ├──Warning
   ├──Socket
   │  └──Certificate
   ├──InvalidHeaderKey
   ├──InvalidHeaderValue
   ├──Timeout
   ├──ResponseParse
   ├──ProxyConnectionError
   ├──ProxyParse
   ├──TooManyRedirects
   └──HttpStatus
      ├──Informational
      ├──Success
      │  ├──OK
      │  ├──Created
      │  ├──Accepted
      │  └──[...]
      ├──Redirection
      ├──Client
      │  ├──BadRequest
      │  ├──Unauthorized
      │  ├──PaymentRequired
      │  ├──Forbidden
      │  ├──NotFound
      │  ├──MethodNotAllowed
      │  ├──NotAcceptable
      │  ├──ProxyAuthenticationRequired
      │  ├──RequestTimeout
      │  ├──Conflict
      │  ├──Gone
      │  └──[...]
      └──Server
         ├──InternalServerError
         ├──BadGateway
         ├──ServiceUnavailable
         └──GatewayTimeout

It's the inclusion of Excon::Error::Client here that I question, almost all of this class of error seem like problems that won't be helped with a retry, perhaps with the exception of RequestTimeout and TooManyRequests and all of these will behave similarly to NotFound, causing performance to regress from <20ms to >6000ms.

I'll open up a PR here to patch the default configuration of retry_errors in this project, but it's certainly debatable whether the defaults being inherited are something that should be changed upstream in Excon itself.

I do also wonder on reflection whether retry_limit: 5, retry_interval: 1 solved too narrowly for one use case - 1 second is an awfully long time in some contexts, particularly when response times for some S3 calls could be only a few milliseconds.

@geemus
Copy link
Member

geemus commented Sep 26, 2023

Thanks for the detailed description. It's been some time, but if I recall correctly, I think this was intended to provide ease of use around the eventually consistent nature of S3. ie for a use case like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
file = bucket.files.create(key: 'some-new-object', value: 'some-new-value')
puts bucket.files.get('some-non-existent-object')

The idea being that this helps avoid that raising an error if you have reason to believe it is already there. This doesn't really consider the case where that is being used to check for existence though, which as you suggest is made much worse for this.

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case), and this may also be why the retries are somewhat long (as this tends to take somewhat longer to resolve).

It seems like we may want this to be more tunable and/or different for different calls, but I'm not entirely sure what the best approach is to maintain existing behavior and fix for this case. I wonder if maybe we should use HEAD for existence checks instead, which would give us a place to apply different settings, for instance. It also has the benefit of allowing existence checks to be faster/cheaper, even for rather large objects. What do you think? I'm happy to discuss, and will look at the PRs presently.

@rahim
Copy link
Contributor Author

rahim commented Sep 26, 2023

I think this was intended to provide ease of use around the eventually consistent nature of S3

💡

That's not something that had occurred to me.

I'll have a think, like you say, it's not immediately clear what the best approach is.

@rajyan
Copy link
Contributor

rajyan commented Sep 26, 2023

First of all, thank you for opening this issue.
I wasn't aware enough of the cause with the change in #675, because our service mainly uploads bunch of files, and we never access to 404 keys.

For the problems that @geemus is mentioning, AWS S3 now supports strong consistency, and I believe that 404 scenario won't happen any more (I will test it later).

related articles
https://aws.amazon.com/jp/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
https://markdboyd.medium.com/how-to-handle-eventual-consistency-with-s3-5cfbe97d1f18
https://saturncloud.io/blog/understanding-the-amazon-s3-data-consistency-model/

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case)

so, I now vote for configuring defaults to only retry on non client errors.

@rajyan
Copy link
Contributor

rajyan commented Sep 27, 2023

For the hierarchy of HtttpStatus > Success, I found an explanation here
excon/excon#694

@geemus
Copy link
Member

geemus commented Sep 27, 2023

Oh, interesting. I definitely missed the memo on strong consistency. That does seem to remove the need for client error retries (or at least I don't recall any other real reason for it).

@geemus
Copy link
Member

geemus commented Sep 27, 2023

Although, that being said. I think other AWS services historically had similar issues at times. ie I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

@rajyan
Copy link
Contributor

rajyan commented Sep 27, 2023

@geemus

Found this https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
and it's seems that EC2 API follows an eventual consistent model.

I think EC2 wouldn't always return objects after create.

so this is still true for EC2.

Although, isn't this https://github.com/fog/fog-aws/pull/691/files config only related to S3?

@geemus
Copy link
Member

geemus commented Sep 27, 2023

@rajyan Great find on those docs, good point on the change only impacting S3, and thanks for the boto examples. Your references have been super helpful in thinking through this and figuring out a way forward, I really appreciate the help.

It seems like it's becoming clearer and clearer that this would be a good safe change, at least for S3 (and quite possibly for excon, though we might then want to add client errors for the eventual consistency stuff into ec2). I think in the EC2 case, even though there could be similar performance issues to what we've been discussing, the use cases are SO much different that it probably still makes sense.

I will try to circle back to this, hopefully tomorrow if not in the next couple days, and make a call about getting some of this in. It's just getting late in the day and I have a one month old in the house, so I want to make sure I can come to this a bit fresher with more focus/energy.

@rahim
Copy link
Contributor Author

rahim commented Sep 28, 2023

@rajyan thank you for the links (both on S3 consistency and boto's behaviour) - both really useful.

boto's declarative approach is very nice, it's notable that:

  • for S3 it does not retry on 404
  • it has affordance for configuring delay, including support for exponential backoff
  • where retry applies, the base default includes exponential backoff and some jitter

It feels like exponential backoff support would be a nice addition to Excon,
possibly a good default when users of Excon choose retry behaviour.

I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

@geemus https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency was the most relevant thing I found, it certainly reads that way.

It's worth noting the default retry behaviour as it currently stands:

  • Excon:
    retry_limit: 4, retry_interval: 0, retry_errors: [Timeout, Socket, HTTPStatus]
    (retry depends on Idempotent middleware being active)
    => a very broad range of errors will be retried as fast as possible up to four times, for many services this will all happen so quickly as to have minimal impact on overall response time, but for any situation where the remedy is "wait a bit longer" is likely be minimally helpful.
  • fog-aws:
    mostly inherits the Excon defaults, with exceptions for:
  • other fog cloud libraries:
    From some very basic grepping, it looks like these mostly manually reimplement retry in the few cases where someone's decided it matters for the service at hand. Mostly, they're just inheriting the Excon default behaviour.

It's just getting late in the day and I have a one month old in the house

Congratulations! 😄

That's far more important than this, I hope you get some sleep!

geemus pushed a commit that referenced this issue Sep 29, 2023
* Fog::AWS::Storage don't retry client errors

Resolves #690

* Fog::AWS::Storage merge connection options
@geemus
Copy link
Member

geemus commented Sep 29, 2023

@rahim thanks for pulling together that overview of the current defaults.

I've merged the suggested changes for S3 in particular at least, thanks to everyone for contributing to the discussion and talking through it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants