Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reloading root certificates to improve concurrent performance #6667

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

agubelu
Copy link

@agubelu agubelu commented Mar 20, 2024

Reproducing the problem

Let's consider the following script. It runs a bunch of concurrent requests against a URL, both with certificate verification enabled and disabled, and outputs the time it takes to do it in both cases.

from time import time
from threading import Thread
import requests
import urllib3

urllib3.disable_warnings()

def do_request(verify):
    requests.get('https://example.com', verify=verify)

def measure(verify):
    threads = [Thread(target=do_request, args=(verify,)) for _ in range(30)]

    start = time()
    for t in threads: t.start()
    for t in threads: t.join()
    end = time()

    print(end - start)

measure(verify=True)
measure(verify=False)

What's the time difference between the two? It turns out it is highly dependent on your local configuration. On my local machine, with a relatively modern config (Python 3.12 + OpenSSL 3.0.2), the times are ~1.2s for verify=True and ~0.5s for verify=False.

It's a >100% difference, but we initially blamed it on cert verification taking some time. However, we observed even larger differences (>500%) in some environments, and decided to find out what was going on.

Problem description

Our main use case for requests is running lots of requests concurrently, and we spent some time bisecting this oddity to see if there was room for a performance optimization.

The issue is a bit more clear if you profile the concurrent executions. When verifying certs, these are the top 3 function calls by time spent in them:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
30/1    0.681    0.023    0.002    0.002 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
30/1    0.181    0.006    0.002    0.002 {method 'connect' of '_socket.socket' objects}
60/2    0.180    0.003    1.323    0.662 {method 'read' of '_ssl._SSLSocket' objects}

Conversely, this is how the top 3 looks like without cert verification:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
30/1    0.233    0.008    0.001    0.001 {method 'do_handshake' of '_ssl._SSLSocket' objects}
30/1    0.106    0.004    0.002    0.002 {method 'connect' of '_socket.socket' objects}
60/2    0.063    0.001    0.505    0.253 {method 'read' of '_ssl._SSLSocket' objects}

In the first case, a full 0.68 seconds are spent in the load_verify_locations() function of the ssl module, which configures a SSLContext object to use a set of CA certificates for validation. Inside it, there is a C FFI call to OpenSSL's SSL_CTX_load_verify_locations() which is known to be quite slow. This happens once per request (hence the 30 on the left).

We believe that, in some cases, there is even some blocking going on, either because each FFI call locks up the GIL or because of some thread safety mechanisms in OpenSSL itself. We also think that this is more or less pronounced depending on internal changes between OpenSSL's versions, hence the variability between environments.

When cert validation isn't needed, these calls are skipped which speeds up concurrent performance dramatically.

Submitted solution

It isn't possible to skip loading root CA certificates entirely, but it isn't necessary to do it on every request. More specifically, a call to load_verify_locations() happens when:

  • A new urllib3.connectionpool.HTTPSConnectionPool is created.

  • On connection, by urllib3's ssl_wrap_socket(), when the connection's ca_certs or ca_cert_dir attributes are set (see the relevant code).

The first case doesn't need to be addressed anymore after the latest addition of _get_connection(). Since it now passes down pool_kwargs, this allows urllib3 to use a cached pool with the same settings every time, instead of creating one per request.

The second one is addressed in this PR. If a verified connection is requested, _urllib3_request_context() already makes it so that a connection pool using a SSLContext with all relevant certificates loaded is always used. Hence, there is no need to trigger a call to load_verify_locations() again.

You can test against https://invalid.badssl.com to check that verify=True and verify=False still behave as expected and are now equally fast.

I'd like to mention that there have been a few changes in Requests since I started drafting this, and I'm not sure that setting conn.ca_certs or conn.ca_certs = cert_loc in cert_verify() is even still needed, since I think that the logic could be moved to _urllib3_request_context() and benefit from using a cached context in those cases too.

@mm-matthias
Copy link

We've been hit by the slowness of load_verify_locations as well. After turning request.get into session.get performances improves a lot. But we still face the performance hit, because:

  • it happens on the first connection (vs. module init time). This creates noise in our live profiles.
  • after connections in the PoolManager/HTTPConnectionPool time out, the SSLContext is recreated over and over
  • HTTPConnectionPools are keyed by host/scheme/... which means a new SSLContext is created for each pool
  • we use session.get(proxies={...}) which leads to even more pools and SSLContext initializations (this part might not yet be covered by this PR)

Is there anything I can do to advance this PR?

@sigmavirus24
Copy link
Contributor

Given this breaks the behavior of the module for a whole class of users as it's written today, there's not much to do to advance it.

Even still, I'm pretty sure SSLComtext is not itself thread safe but I need to find a reference for that so loading it at the module will likely cause issues

@mm-matthias
Copy link

Even still, I'm pretty sure SSLComtext is not itself thread safe but I need to find a reference for that so loading it at the module will likely cause issues

Here's your quote: python/cpython#95031 (comment)

SSLContext is thread and async-safe.

Can't speak for the reliability of that quote though. Given that a lot of time is spent in pthread_rwlock it doesn't sound completely unbelievable though.

@mm-matthias
Copy link

mm-matthias commented May 5, 2024

Here is a more reliable reference from the openssl guys themselves about SSL_CTX. The info is from 2017 though.

EDIT: Here are the official docs for openssl.

Unfortunately the python SSLContext docs don't mention thread safety at all.

Found this in the Python 3.13.0a5 release docs:

ssl.SSLContext.cert_store_stats and ssl.SSLContext.get_ca_certs now correctly lock access to the certificate store, when the ssl.SSLContext is shared across multiple threads.

Asked for official clarification on the thread-safety of SSLContext here.

@tiran
Copy link
Contributor

tiran commented May 5, 2024

SSLContext is designed to be shared and used for multiple connections. It is thread safe as long as you don't reconfigure it once it is used by a connection. Adding new certs to the internal trust store is fine, but changing ciphers, verification settings, or mTLS certs can lead to surprising behavior. The problem is unrelated to threads and can even occur in a single-threaded program.

If you don't trust me, then please trust David Benjamin's statement on threading. He is the main lead behind BoringSSL and did a lot of TLS stuff in Chrome browser.

@mm-matthias
Copy link

@tiran Thank you very much for your detailed commentary!

Given this breaks the behavior of the module for a whole class of users as it's written today, there's not much to do to advance it.

@sigmavirus24 What exactly are you unhappy about with the current solution and what would an acceptable solution for you look like? Would you prefer something like introducing a new "ssl_context" parameter somehow that could be set by the library user and be set to something like None for backwards compatibility? Or would you not want to change the request api in any way?

@tiran
Copy link
Contributor

tiran commented May 5, 2024

See #2118

I recommend that you either use urllib3 directly or switch to httpx. Most of the secret sauce of requests is in urllib3. httpx has HTTP/2 support.

@sigmavirus24
Copy link
Contributor

@tiran Thank you very much for your detailed commentary!

Given this breaks the behavior of the module for a whole class of users as it's written today, there's not much to do to advance it.

@sigmavirus24 What exactly are you unhappy about with the current solution and what would an acceptable solution for you look like? Would you prefer something like introducing a new "ssl_context" parameter somehow that could be set by the library user and be set to something like None for backwards compatibility? Or would you not want to change the request api in any way?

Requests supports being run (with certifi) inside a zip file created by a tool like pants, pyinstaller, etc. The code here removes that support in loading the trust stores.

@sigmavirus24
Copy link
Contributor

Ah, I see that the PR was updated and moved the extraction. I think the last blocker is the context being "public" in how it is named. That will encourage people to modify it in a way we don't want to be supporting. People will attempt to modify it anyway but when things go wrong, it will be clear that they weren't intended to modify it.

To be clear, I expect people to use this to work around libraries that aren't written correctly that leverage requests but do not allow people to specify a trust store or customize a session. Either way, if we rename the default I'm in favor of merging this. Unless @nateprewitt has objections.

(And yes, the caveats around changing ciphers or loading other trust stores is what I recalled being a problem that has odd behavior but even so, making this private will alleviate that likely spike in issues.)

@agubelu
Copy link
Author

agubelu commented May 7, 2024

Thanks for the follow-up @sigmavirus24. I renamed the default context as requested, please let me know if you'd like any further changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants