Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: cache calls to signing key #15337

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

miketheman
Copy link
Member

With this simple caching mechanism, each running instance should only have to make a single call at their first instantiation, and cache the result for the lifetime of the process.

This call rarely fails, and adds ~200ms of each inbound hook, so caching across requests should cut down the time it takes to complete the processing.

Instead of using a Redis cache and worrying about cache expiration strategies, if this ever fails a restart should evict the in-memory cache and trigger a new HTTP call for the key.

Resolves #4463

With this simple caching mechanism, each running instance should only
have to make a single call at their first instantiation, and cache the
result for the lifetime of the process.

This call rarely fails, and adds ~200ms of each inbound hook, so
caching across requests should cut down the time it takes to complete
the processing.

Instead of using a Redis cache and worrying about cache expiration
strategies, if this ever fails a restart should evict the in-memory
cache and trigger a new HTTP call for the key.

Resolves pypi#4463

Signed-off-by: Mike Fiedler <miketheman@gmail.com>
@miketheman miketheman requested a review from a team as a code owner February 6, 2024 21:00
Copy link
Member

@ewdurbin ewdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a pure "ops" perspective i'm -1 on the assumed recovery scenario of restarting the process. while it is straightforward, how are we to recall that this could be the necessary action if it happened?

@ewdurbin
Copy link
Member

ewdurbin commented Feb 6, 2024

our gunicorn configuration simultaneously makes recovery a non-issue and reduces the impact of this cache:

max_requests = 2048
max_requests_jitter = 128

i'm now -0

@miketheman
Copy link
Member Author

That is interesting, didn't know that. Does this mean we are recycling worker processes every few minutes now?
If we run a single web worker per instance, since we don't set the workers value, we default to 1.

We have 40 deployed instances of web workers right now, and with a request rate of ~500 requests per second (imprecise, but demonstrative), and each request takes ~150ms to complete (again, using estimations here) - so they'd all exhaust and restart after ~2.5 minutes or so.

Does that align with your understanding of our current universe? If it does, should we consider the max_requests value from 2018?

I couldn't find a Redis-backed cache pattern yet in the codebase, I guess we don't do that very much yet, unless I'm looking in the wrong place?

@ewdurbin
Copy link
Member

ewdurbin commented Feb 7, 2024

Yes, this is effectively a diaper for memory leaks. Each worker gracefully reloads after 1920-2176 requests. This was put in place as our web processes slowly leaked memory over time leading to non graceful restarts.

Ideally we just don't leak memory! I'd reconsider the value if we had demonstrably stable memory usage without.

@ewdurbin
Copy link
Member

ewdurbin commented Feb 7, 2024

Note: We also tune worker count by the WEB_CONCURRENCY env var rather than hardcoding so we don't have to push code to tune it.

@miketheman
Copy link
Member Author

miketheman commented Feb 7, 2024

Ideally we just don't leak memory! I'd reconsider the value if we had demonstrably stable memory usage without.

Makes sense! Did you have a way to test/verify memory leakage/profiling prior to production changes in mind, or is this a "let's double max_requests and observe" kind of thing?

Note: We also tune worker count by the WEB_CONCURRENCY env var rather than hardcoding so we don't have to push code to tune it.

WEB_CONCURRENCY was removed circa 2015 in https://github.com/pypi/warehouse/pull/741/files#diff-0a99231995da379e7aebabe76c9d849a23737a42c3b3a8994043e2aa80958424 and I can't find another tunable for gunicorn

@ewdurbin
Copy link
Member

ewdurbin commented Feb 7, 2024

WEB_CONCURRENCY is native to gunicorn: https://docs.gunicorn.org/en/stable/settings.html#workers

@miketheman
Copy link
Member Author

WEB_CONCURRENCY is native to gunicorn: docs.gunicorn.org/en/stable/settings.html#workers

🤯 That'll learn me to read the docs more

@ewdurbin
Copy link
Member

ewdurbin commented Feb 7, 2024

Makes sense! Did you have a way to test/verify memory leakage/profiling prior to production changes in mind, or is this a "let's double max_requests and observe" kind of thing?

Measured via datadog, couldn't track it down. gave up!

@miketheman miketheman added the blocked Issues we can't or shouldn't get to yet label Feb 7, 2024
@miketheman miketheman marked this pull request as draft February 7, 2024 17:10
@ewdurbin
Copy link
Member

ewdurbin commented Feb 8, 2024

FWIW the leak appears to still exist, here's a sample trace that appears as i recall, a sudden spike in memory usage that remains until the worker is reaped.
Screenshot 2024-02-08 at 9 58 09 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Issues we can't or shouldn't get to yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add resilience to the network call to fetch the SNS signing certificate
2 participants