Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue due to SSLContext being recreated over and over #3171

Open
mm-matthias opened this issue May 5, 2024 · 1 comment
Open
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue

Comments

@mm-matthias
Copy link

mm-matthias commented May 5, 2024

Describe the bug

botocore request/response times are much slower than expected in certain scenarios due to unexpected SSL overhead.

Expected Behavior

botocore /boto3 performs fast for all requests.

Ideally the first request does not take longer than any following requests (e.g. the SSLContext could be initialized during module init or be triggered by the user of the library in some way before any requests happen).

Current Behavior

Request/responses are too slow, see below.

Reproduction Steps

botocore performance is much slower than expected in certain scenarios. This can easily be reproduced by

import boto3
import time

session = boto3.session.Session()
for i in range(10):
    start = time.time()
    session.resource("s3").Object("mybucket", "mykey").get()["Body"].read()
    print(time.time() - start)

which yields

0.4858591556549072
0.1862480640411377
0.20270681381225586
0.22403812408447266
0.19844722747802734
0.18699288368225098
0.17718195915222168
0.18413686752319336
0.1918637752532959
0.18160700798034668

This can be optimized by sharing the resource between iterations:

import boto3
import time

session = boto3.session.Session()
resource = session.resource("s3")
for i in range(10):
    start = time.time()
    resource.Object("mybucket", "mykey").get()["Body"].read()
    print(time.time() - start)

which yields

0.25886106491088867
0.05055093765258789
0.050694942474365234
0.05214500427246094
0.05020618438720703
0.05117917060852051
0.04602789878845215
0.048911094665527344
0.048056840896606445
0.04799509048461914

One can see that the initial request is always quite slow.

But the issue goes further. If there is some delay between requests all the requests get slow. This can be reproduced with:

import boto3
import time

session = boto3.session.Session()
resource = session.resource("s3")

for i in range(10):
    start = time.time()
    resource.Object("mybucket", "mykey").get()["Body"].read()
    print(time.time() - start)
    time.sleep(10)

This yields

0.22714900970458984
0.17821383476257324
0.21504521369934082
0.17998909950256348
0.2338550090789795
0.19853711128234863
0.23908185958862305

Switching to sleep(5) yields fast requests again (except for the first one).

The reason for the slowness is that

  • the http keep-alive timeout that botocore uses implicitly via urllib3 seems to be rather low (probably between 5 and 10 seconds) and there doesn't seem to be a straightforward way to increase it. This causes an increased number of connection creations and ssl handshakes the as well as the issue mentioned in the next point.
    • enabling debug logging for urllib3 suggests that is_connection_dropped returns True for the connections that are issue after a sleep(10). So presumably something (the s3 server?) closed the connection. In this case there might not be a lot of things that botocore could do, except for the next point about making SSL connection creation faster.
  • initializing SSL connections has become a lot slower in recent python versions, see OpenSSL 3.0 performance issue: SSLContext.set_default_verify_paths / load_verify_locations about 5x slower python/cpython#95031 . Other major libraries like requests are also affected, see Avoid reloading root certificates to improve concurrent performance psf/requests#6667 .
    The slow requests are slow, because load_verify_locations is called once for each of them. This happens through the urllib3 PoolManager/HTTPConnectionPool and the recreation of SSLContext.
    You can verify this by running any of the above reproducers in a profiler and look at the number of calls to/time spent in load_verify_locations.
  • The problem is even exacerbated when using a proxy as this leads to even more SSL calls. Adding different hosts to this makes the problem even worse as this leads to more HTTPConnectionPools being created.

Possible Solution

  • The solution for the http keep-alive timeout is to make this configurable somehow so that more requests can benefit from connection pooling.
  • The solution to avoid the overhead from load_verify_locations is probably to share the SSLContext across multiple invocations/threads. The requests issue I've mentioned above has a proposed workaround.

Additional Information/Context

No response

SDK version used

1.34.98

Environment details (OS name and version, etc.)

MacOS & Linux
Python 3.12.3

@mm-matthias mm-matthias added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels May 5, 2024
@tim-finnigan tim-finnigan self-assigned this May 14, 2024
@tim-finnigan tim-finnigan added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label May 14, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reporting this issue — after discussing with the team we decided that we should continue tracking this for further review and investigation.

@tim-finnigan tim-finnigan added p2 This is a standard priority issue and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-triage This issue or PR still needs to be triaged. labels May 14, 2024
@tim-finnigan tim-finnigan removed their assignment May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants