New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690
Comments
Thanks for the detailed description. It's been some time, but if I recall correctly, I think this was intended to provide ease of use around the eventually consistent nature of S3. ie for a use case like:
The idea being that this helps avoid that raising an error if you have reason to believe it is already there. This doesn't really consider the case where that is being used to check for existence though, which as you suggest is made much worse for this. In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case), and this may also be why the retries are somewhat long (as this tends to take somewhat longer to resolve). It seems like we may want this to be more tunable and/or different for different calls, but I'm not entirely sure what the best approach is to maintain existing behavior and fix for this case. I wonder if maybe we should use HEAD for existence checks instead, which would give us a place to apply different settings, for instance. It also has the benefit of allowing existence checks to be faster/cheaper, even for rather large objects. What do you think? I'm happy to discuss, and will look at the PRs presently. |
💡 That's not something that had occurred to me. I'll have a think, like you say, it's not immediately clear what the best approach is. |
First of all, thank you for opening this issue. For the problems that @geemus is mentioning, AWS S3 now supports strong consistency, and I believe that 404 scenario won't happen any more (I will test it later). related articles
so, I now vote for configuring defaults to only retry on non client errors. |
For the hierarchy of HtttpStatus > Success, I found an explanation here |
Oh, interesting. I definitely missed the memo on strong consistency. That does seem to remove the need for client error retries (or at least I don't recall any other real reason for it). |
Although, that being said. I think other AWS services historically had similar issues at times. ie I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search. |
Found this https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
so this is still true for EC2. Although, isn't this https://github.com/fog/fog-aws/pull/691/files config only related to S3? |
For additional information boto retries on general errors and some specific s3 client errors |
@rajyan Great find on those docs, good point on the change only impacting S3, and thanks for the boto examples. Your references have been super helpful in thinking through this and figuring out a way forward, I really appreciate the help. It seems like it's becoming clearer and clearer that this would be a good safe change, at least for S3 (and quite possibly for excon, though we might then want to add client errors for the eventual consistency stuff into ec2). I think in the EC2 case, even though there could be similar performance issues to what we've been discussing, the use cases are SO much different that it probably still makes sense. I will try to circle back to this, hopefully tomorrow if not in the next couple days, and make a call about getting some of this in. It's just getting late in the day and I have a one month old in the house, so I want to make sure I can come to this a bit fresher with more focus/energy. |
@rajyan thank you for the links (both on S3 consistency and boto's behaviour) - both really useful. boto's declarative approach is very nice, it's notable that:
It feels like exponential backoff support would be a nice addition to Excon,
@geemus https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency was the most relevant thing I found, it certainly reads that way. It's worth noting the default retry behaviour as it currently stands:
Congratulations! 😄 That's far more important than this, I hope you get some sleep! |
* Fog::AWS::Storage don't retry client errors Resolves #690 * Fog::AWS::Storage merge connection options
@rahim thanks for pulling together that overview of the current defaults. I've merged the suggested changes for S3 in particular at least, thanks to everyone for contributing to the discussion and talking through it. |
The default retry configuration for
Fog::AWS::Storage
fog-aws/lib/fog/aws/storage.rb
Line 549 in b073ba8
combined with Excon's defaults leads to a situation where a GET of an S3 object that may or may not exist will take 6 seconds to answer "no - the object was not found".
That retry configuration was added in #674 solving for a batch upload scenario.
This can be reproduced with something like:
I have a script that does that for various retry configurations, example results:
I noticed this because we were using an old version of fog that predated the changes to retry configuration. We back ported just that configuration and observed significant performance regression on a couple of endpoints where the probability of a GET to an S3 object that didn't exist were high.
Excon sets the following for
retry_errors
https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/constants.rb#L19-L23Somewhat weirdly, the inheritance tree for successful responses has them as descendants of
Excon::Error
. https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/error.rb#L68-L125Incomplete tree for some additional context:
It's the inclusion of
Excon::Error::Client
here that I question, almost all of this class of error seem like problems that won't be helped with a retry, perhaps with the exception ofRequestTimeout
andTooManyRequests
and all of these will behave similarly toNotFound
, causing performance to regress from <20ms to >6000ms.I'll open up a PR here to patch the default configuration of
retry_errors
in this project, but it's certainly debatable whether the defaults being inherited are something that should be changed upstream in Excon itself.I do also wonder on reflection whether
retry_limit: 5, retry_interval: 1
solved too narrowly for one use case - 1 second is an awfully long time in some contexts, particularly when response times for some S3 calls could be only a few milliseconds.The text was updated successfully, but these errors were encountered: