Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resilience to the network call to fetch the SNS signing certificate #4463

Open
dstufft opened this issue Aug 4, 2018 · 1 comment · May be fixed by #15337
Open

Add resilience to the network call to fetch the SNS signing certificate #4463

dstufft opened this issue Aug 4, 2018 · 1 comment · May be fixed by #15337
Assignees

Comments

@dstufft
Copy link
Member

dstufft commented Aug 4, 2018

Whenever we're verifying a SNS message, we have to fetch the public certificate from an HTTP url provided to us by Amazon. If fetching this fails for any reason, we will error and will rely on SNS retrying the request to get it accurately recorded.

We can do better!

There are two possible strategies I can think of here, and the right answer might be to use one or the other, or both.

  • Cache the public key.
    • The HTTP response at the URL does not indicate that it can be cached, however on the AWS forums AWS has indicated that if/when they change the certificate they will use a different URL. That means one option here is to just cache the signing certificate for a long time. This could either just be a simply in memory cache (in which case we will refetch it anytime we restart the process) or utilizing redis to store the cached signing URL so that the cache survives restarts, is shared amongst processes etc.
    • This cache should expire some how, probably some sort of LRU that keeps some number of keys but will evict older ones when needed.
  • Add retries.
    • Whenever we get an error, simply try fetching it again! This will make the HTTP request take longer and it's possible that whatever network error is effecting us will last longer then we're willing to have a single request take, so it doesn't eliminate the problem, but makes us survive momentary blips better.

My opinion is I'd start with caching, ideally with a redis based cache and see where that leaves us. It will likely make the failures infrequent enough as to not be worth worrying about, and will make verifying the signature faster as well.

@dstufft
Copy link
Member Author

dstufft commented Aug 10, 2018

With retries and #4526 this is alrgely done. I'm going to leave this open because I believe that adding caching here would still be a good step.

miketheman added a commit to miketheman/warehouse that referenced this issue Feb 6, 2024
With this simple caching mechanism, each running instance should only
have to make a single call at their first instantiation, and cache the
result for the lifetime of the process.

This call rarely fails, and adds ~200ms of each inbound hook, so
caching across requests should cut down the time it takes to complete
the processing.

Instead of using a Redis cache and worrying about cache expiration
strategies, if this ever fails a restart should evict the in-memory
cache and trigger a new HTTP call for the key.

Resolves pypi#4463

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
@miketheman miketheman linked a pull request Feb 6, 2024 that will close this issue
@miketheman miketheman self-assigned this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants