Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage on multithreading #1670

Open
jbvsmo opened this issue Aug 21, 2018 · 34 comments
Open

Excessive memory usage on multithreading #1670

jbvsmo opened this issue Aug 21, 2018 · 34 comments
Labels
needs-review p2 This is a standard priority issue

Comments

@jbvsmo
Copy link

jbvsmo commented Aug 21, 2018

I have been trying to debug a "memory leak" in my newly upgraded boto3 application. I am moving from the original boto 2.49.

My application starts a pool of 100 thread and every request is queued and redirected to one of these threads and usual memory for the lifetime of the appication was about 1GB with peaks of 1.5GB depending of the operation.

After the upgrade I added one boto3.Session per thread and I access multiple resources and clients from this session which are reused throughout the code. On previous code I would have a boto connection of each kind per thread (I use several services like S3, DynamoDB, SES, SQS, Mturk, SimpleDB) so it is pretty much the same thing.

Except that each boto3.Session alone uses increases memory usage immensely and now my application is running on 3GB of memory instead.

How do I know it is the boto3 Session, you ask? I created 2 demo experiments with the same 100 threads and the only difference on both is using boto3 in one and not on the other.

Program 1: https://pastebin.com/Urkh3TDU
Program 2: https://pastebin.com/eDWPcS8C (Same thing with 5 lines regarding boto commented out)

Output program 1 (each print happens 5 seconds after the last one):

Process Memory: 39.4 MB
Process Memory: 261.7 MB
Process Memory: 518.7 MB
Process Memory: 788.2 MB
Process Memory: 944.5 MB
Process Memory: 940.1 MB
Process Memory: 944.4 MB
Process Memory: 948.7 MB
Process Memory: 959.1 MB
Process Memory: 957.4 MB
Process Memory: 958.0 MB
Process Memory: 959.5 MB

Now with plain multiple threads and no AWS access.
Output program 2 (each print happens 5 seconds after the last one):

Process Memory: 23.5 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB

Alone the boto3 session object is retaining 10MB per thread in a total of about 1GB. This is not acceptable from an object that should not be doing much more than requesting stuff to the AWS servers only. It means that the Session is keeping lots of unwanted information.

You could be wondering if it is not the resource that is keeping live memory. If you move the resource creation to inside the for loop, the program will also hit the 1GB in the exact the same 15 to 20 seconds of existence.

In the beginning I tried garbage collecting for cyclic references but it was futile. The decrease in memory was only a couple megabytes.

I've seen people complaining on botocore project on something similar (maybe not!), so it might be a shared issue.
boto/botocore#805

@jbvsmo
Copy link
Author

jbvsmo commented Aug 21, 2018

I forgot to mention that I added cyclic garbage collection into the 5 second loop that will display the memory. If this is removed, the memory will increase even more (and it doesn't seem to stop) which means someone is also leaking circular references.

Now I noticed something even worse. If I create a new session inside the loop, the memory usage will be even higher, even with the garbage collection in place.

This program I linked is simple enough and yet the memory issues are so visible I'm wondering if no one saw it before or if maybe this is related to some recent boto version.

boto3: 1.7.71
botocore: 1.10.71

Program output https://pastebin.com/Nm4dWPKJ :

Process Memory: 23.6 MB
Process Memory: 234.5 MB
Process Memory: 470.6 MB
Process Memory: 719.7 MB
Process Memory: 994.3 MB
Process Memory: 1144.7 MB
Process Memory: 1129.9 MB
Process Memory: 1160.5 MB
Process Memory: 1222.5 MB
Process Memory: 1200.5 MB
Process Memory: 1176.4 MB
Process Memory: 1173.8 MB
Process Memory: 1200.2 MB
Process Memory: 1342.9 MB
Process Memory: 1341.3 MB

@jbvsmo
Copy link
Author

jbvsmo commented Aug 21, 2018

Some more investigation (sorry for so much noise):

  • There are probably two issues here:
    1. Memory leak on any Python version and any boto version when Sessions are created inside a loop
    2. Very high memory usage on Python 2.7 and high on 3.7 (but acceptable)

I was initially only testing Python 2.7.15, but now that I also ran the program on Python 3.7.0 the memory usage is about half (500MB) with or without cyclic garbage collection, which is great.

On Python 3, the leak still happens if I create the session within the for loop on every thread! Just the increase in memory is slower this time.

I decided to test older boto versions (from boto3 1.0 to 1.7) with Python 2.7 and they all show the leaking pattern when session is created inside a loop, BUT on boto 1.5 and lower memory usage is 100 MB lower and on boto 1.2 and lower memory takes 2 minutes to reach that value instead of 20 seconds.

I noticed that if I explicitly do del s3 the memory will go down to 200 to 300MB total, which is super crazy. No python code is expected to run del since the reference count should be taking care of stuff but probably isn't!!

I cannot do this in my code since I need to reuse resources and I'm starting to be out of options...

@joguSD joguSD added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Aug 23, 2018
@joguSD
Copy link
Contributor

joguSD commented Aug 23, 2018

At first I thought this might be related to boto/botocore#1248 which is the only confirmed leak I know of.

However, looking into this it seems to me that this is related to the client/resource object. That being said this isn't a memory leak, the reason you're seeing the ramp up in memory is that each time you create a session/client we have to go to disk to load the JSON models representing the service, etc. There's so much contention on a single file it takes ~20-30 seconds to even instantiate all 100 session/clients, considering each session has its own cache I'm actually not all that surprised by these memory usage numbers.

I would suggest doing something like this:

def run_pool(size):
    ts = []
    session = boto3.Session()
    for x in range(size):
        s3 = session.resource('s3')
        t = Boto3Thread(s3)
        t.start()
        ts.append(t)
    return ts

This way you only instantiate one session, and can actually leverage the caching that the session provides to instantiate all 100 resource objects to give to each thread.

@joguSD joguSD added closing-soon This issue will automatically close in 4 days unless further comments are made. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Aug 23, 2018
@jbvsmo
Copy link
Author

jbvsmo commented Aug 23, 2018

@joguSD Sorry but that doesn't explain why memory is not being released even with cyclic garbage collection in place.

And the most strange part is if I do del s3 at the end of the for loop, a really good chunk of memory will go away. No python code is really expect to "free" resources.

The code you provided is very alike from what I ended up using in my application but besides the 100 threads with the same lifespan as the program, I occasionally run other things in parallel on other threads and those add to total memory that is never freed!

After a few days, my 8GB server is out of memory again. This is at least poor memory management on boto3 part. The only solution I am seeing is revert 2 weeks of porting code and go back to boto 2.49

@no-response no-response bot removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 23, 2018
@kapilt
Copy link

kapilt commented Sep 7, 2018

We also started experiencing this. Here's a quick output of py-spy. Note the excessive thread count (and corresponding memory usage).This code was stable for years and dozens of boto3/botocore versions. There's clearly something buggy in the transition to urllib3. The app code here is using the s3 transfer service to upload a few files.

Its also worth noting, this app isn't using threads where this triggers, the thread usage are entirely from s3transfer library. py 2.7.15, version freeze below

GIL: 0.00%, Active: 100.00%, Threads: 711

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                         
100.00% 100.00%   10.44s    10.44s   ssl_wrap_socket (urllib3/util/ssl_.py:336)
  0.00% 100.00%   0.000s    10.44s   _make_request (urllib3/connectionpool.py:343)
  0.00% 100.00%   0.000s    10.44s   _execute_main (s3transfer/tasks.py:150)
  0.00% 100.00%   0.000s    10.44s   _send (botocore/endpoint.py:215)
  0.00% 100.00%   0.000s    10.44s   __bootstrap (threading.py:774)
  0.00% 100.00%   0.000s    10.44s   send (botocore/httpsession.py:242)
  0.00% 100.00%   0.000s    10.44s   _make_api_call (botocore/client.py:599)
  0.00% 100.00%   0.000s    10.44s   __bootstrap_inner (threading.py:801)
  0.00% 100.00%   0.000s    10.44s   run (concurrent/futures/thread.py:63)
  0.00% 100.00%   0.000s    10.44s   urlopen (urllib3/connectionpool.py:600)
  0.00% 100.00%   0.000s    10.44s   __call__ (s3transfer/tasks.py:126)
  0.00% 100.00%   0.000s    10.44s   _main (s3transfer/upload.py:692)
  0.00% 100.00%   0.000s    10.44s   make_request (botocore/endpoint.py:102)
  0.00% 100.00%   0.000s    10.44s   _send_request (botocore/endpoint.py:146)
  0.00% 100.00%   0.000s    10.44s   _get_response (botocore/endpoint.py:173)
  0.00% 100.00%   0.000s    10.44s   _worker (concurrent/futures/thread.py:75)
  0.00% 100.00%   0.000s    10.44s   run (threading.py:754)
  0.00% 100.00%   0.000s    10.44s   _validate_conn (urllib3/connectionpool.py:849)
  0.00% 100.00%   0.000s    10.44s   connect (urllib3/connection.py:356)
  0.00% 100.00%   0.000s    10.44s   _api_call (botocore/client.py:314)
# pip freeze
argcomplete==1.9.4
boto3==1.8.7
botocore==1.11.7
certifi==2018.8.24
chardet==3.0.4
click==6.7
decorator==4.3.0
docutils==0.14
functools32==3.2.3.post2
futures==3.2.0
idna==2.7
jmespath==0.9.3
jsonpatch==1.23
jsonpointer==2.0
jsonschema==2.6.0
python-dateutil==2.7.3
PyYAML==3.13
requests==2.19.1
s3transfer==0.1.13
simplejson==3.16.0
six==1.11.0
tabulate==0.8.2
urllib3==1.23
virtualenv==16.0.0
websocket-client==0.52.0

@joguSD
Copy link
Contributor

joguSD commented Sep 7, 2018

@kapilt Considering the original issue was raised before the urllib3 changes were released I'm not sure if what you're experiencing is related. In my original analysis of this issue actually carrying out an API call or not didn't make a difference and had everything to do with instantiating 100 different sessions.

@kapilt
Copy link

kapilt commented Sep 8, 2018

@joguSD that's fair, re-reading its not entirely clear its the same ilk. i'll file as a separate issue after some more analysis and differential to the urllib3 change along and checking s3 transfer parameters to not use threads. fwiw, we do create a bunch of sessions as well but all are out of scope here and free to be gc'd.

@yangkang55
Copy link

@joguSD same problem here ! using boto3 to upload about 30000 little files,
then i used multiprocess to fork about 30 pools, the memory increase from 1GB to 6GB immediately

@maybeshewill
Copy link

maybeshewill commented Nov 1, 2018

confirm, just a simple creation of a boto3.session in threads/async handlers lead to extensive memory usage, that's is not freed at all (gc.collect() doesn't help too)

@kapilt
Copy link

kapilt commented Nov 1, 2018

Fwiw at least for my app switching s3transfer to not use threads resolved a lot of issues wrt to memory.

@wdiu
Copy link

wdiu commented Jan 22, 2019

Hi, we also hit the same problem. The memory keeps increasing and doesn't get released. I tried patching some of the AWS code (including the caching decorators so that they don't cache), manually clearing the loader cache, and adjusting the model loaders not to load the documentation. I noticed as well that the session has a register function, but unregister isn't called, so I kept track of the registered objects and called that too (not sure if that makes a difference). That seemed to bring down the memory, at the expense of caching, but I didn't notice any speed difference. Any feedback or ideas from the AWS team about this?

@Gloix
Copy link

Gloix commented Jan 23, 2019

I'm too experiencing this issue with se S3 Boto client. Reading bucket objects keeps the memory usage pretty well, but writing them with put_object() incurs in growing usage of RAM.

@antonbarua
Copy link

We have noticed this problem too. We are using this in the backend of a flask web application. By nature, the web application is multithreaded. So we cannot instantiate just one session globally in the app.
@joguSD I have noticed a few things, correct me if I am wrong:

  1. boto3/botocore loads the entire json files: services-2.json, resources-1.json, paginators-1.json, endpoints.json, _retry.json etc in memory. Although these files are lazily loaded, is loading the entire file necessary? For example, when a ServiceModel is created for EC2, that file is 23000+ lines long. If I want to just call 1/2 APIs on EC2, then I don't need a ServiceModel that contains all of EC2's APIs and Shapes. Is it possible to just create a service model for the APIs/Objects that I am interested in? For example, when I create the ec2 client, I can pass in the service names that I am interested in as a list.
  2. The JSON files contain documentation. Whenever these files are cached in memory, these documentation strings are also being cached, which is not necessary if I don't use them. The same applies when Client classes are dynamically created and function docstrings are assigned to them. It is kind of dynamically creating the source code in memory, however, for using the methods, the docstrings are not necessary. I think these should be optional too.

@joguSD
Copy link
Contributor

joguSD commented Feb 13, 2019

@antonbarua I suppose something like that might be possible, but it might not be all that practical. Stripping the model down isn't as simple as just keeping the operations you want to use. You'd have to figure out what shapes are needed and which are orphaned and then remove them.

The documentation is there for tools built on top of botocore like the AWS CLI, but from the pure SDK perspective I could see why you wouldn't want this. If you were really inclined you could do a tree-shake of sorts on the model stripping it down to what you need and placing it in ~/.aws/models to be used instead.

@wdiu
Copy link

wdiu commented Feb 13, 2019

Hi @joguSD, if trimming the model to keep only the desired operations isn't practical, what do you think about having an option that disables the cache and calling unregister as described in #1670 (comment) ?
The initial memory consumption may stay the same, but at least it won't keep growing (i.e. it stops the memory leak).

@johnyoonh
Copy link

johnyoonh commented Apr 13, 2019

Is there any wokaround while the fix is on the way :( ?
Edit: "gc.collect()" alleviate the problem a bit. Thanks!

@kaochiuan
Copy link

Hi @Gloix ,

Initial s3 session with static variable can fix the memory leak situation.

import threading
import boto3
import os
import base64
import time
import random
import psutil

BUCKET = '' # <--- YOUR BUCKET NAME HERE

MIN_WAIT = 1
MAX_WAIT = 20


class Boto3Thread(threading.Thread):
    daemon = True
    is_running = True
    __s3_client = boto3.client('s3', region_name='us-east-1')

    def run(self):
        path = 'test_boto/'
        while self.is_running:
            file_name = path + 'file_' + str(random.randrange(100000))
            content = base64.b64encode(os.urandom(100000)).decode()

            self.__s3_client.put_object(
                Bucket=BUCKET,
                Key=file_name,
                Body=content,
                ContentType='text/plain'
            )
            if not self.is_running:
                # Avoid an useless sleep cycle
                break

            sleep_duration = random.randrange(MIN_WAIT, MAX_WAIT)
            #print('{} will sleep for {} seconds'.format(self.name, sleep_duration))
            time.sleep(sleep_duration)

def check_memory():
    import gc
    gc.collect()
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024. / 1024.

def run_pool(size):
    ts = []
    for x in range(size):
        t = Boto3Thread()
        t.start()
        ts.append(t)
    return ts

def stop_pool(ts):
    for t in ts:
        t.is_running = False
    for t in ts:
        t.join()

def main():
    ts = run_pool(100)
    try:
        while True:
            print('Process Memory: {:.1f} MB'.format(check_memory()))
            time.sleep(5)
    except KeyboardInterrupt:
        pass
    finally:
        print('Wait for all threads to finish. Should take about {} seconds!'.format(MAX_WAIT))
        stop_pool(ts)

main()

@yjhouzz
Copy link

yjhouzz commented Aug 22, 2019

Sorry I'm late to the party, but @joguSD may I ask about the suggestion you made (quoted below)?

I would suggest doing something like this:

def run_pool(size):
    ts = []
    session = boto3.Session()
    for x in range(size):
        s3 = session.resource('s3')
        t = Boto3Thread(s3)
        t.start()
        ts.append(t)
    return ts

This way you only instantiate one session, and can actually leverage the caching that the session provides to instantiate all 100 resource objects to give to each thread.

I'm asking because according to https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html#multithreading-multiprocessing it is not recommended for multiple threads to share a session. So, if I have ten threads making separate S3 requests, should they share a session or not?

@irgeek
Copy link

irgeek commented Aug 26, 2019

@yjhouzz The documentation you linked to states (emphasis mine):

It is recommended to create a resource instance for each thread / process in a multithreaded or multiprocess application

The resource in the code snippet is not shared, just the session is.

@longbowrocks
Copy link

longbowrocks commented Oct 16, 2019

@irgeek read further:

In the example above, each thread would have its own Boto 3 session and its own instance of the S3 resource.

Read issue boto/botocore#1246 for more info.

@lucj
Copy link

lucj commented May 30, 2020

I’ve started to use boto3 in a flask application and got the ´cannot allocate memory’ error. Is there any update in this issue and some best practices to use boto3 with flask ?

@bsikander
Copy link

bsikander commented Jun 3, 2020

any solution to this ? My code is very simple but i have memory leaks.
Even with max_concurrency to 1, i have memory leaks. The default is 10 btw. Any help?

I am trying to download 50GB file.

session = boto3.session.Session(
    aws_access_key_id=‘abc’,
    aws_secret_access_key=‘def’,
)
conn = session.resource("s3")
conn.Bucket('mybucket').download_file(
        Filename=download_path + key.split("/")[-1],
        Key=key,
        Callback=print_status,
        Config=boto3.s3.transfer.TransferConfig(
            max_concurrency=1,
            multipart_chunksize=CHUNK_SIZE,
            io_chunksize=CHUNK_SIZE
        )
    )

@bktan81
Copy link

bktan81 commented Jun 29, 2020

I was having the same issue (Flask+boto3+AWS Elastic beanstalk) and it crashed the server for multiple times due to out of memory exception. I tried gc.collect() and other methods and all don't seem to work.

Eventually I figured out that I've to run the function(that uses boto3) separately in a different process(separate python script), so that when the sub-process terminated it also free the memory.

import subprocess

cmd_params = ['python3',F'{os.getcwd()}/run_task.py', 'config.json', 'param1', 'param2']
p = subprocess.Popen(cmd_params, stdout=subprocess.PIPE)
out = p.stdout.read()
output = out.decode("utf-8")

The method is not elegant and it's just a workaround, but it works though.

@pler
Copy link

pler commented Jul 10, 2020

I do observe the same issue in a slightly different context when downloading larger files (10GB+) in Docker containers with a hard limit on memory, with a single boto3 session and no multithreaded invocation of Object.download_file (the code is very similar to #1670 (comment)).

In some cases I can also observe the same error as mentioned in #1670 (comment):

1594222584953   File "/opt/amazon/lib/python3.6/site-packages/s3transfer/utils.py", line 364, in write
1594222584953     self._fileobj.write(data)
1594222584953 OSError: [Errno 12] Cannot allocate memory

It seems that disabling threading in boto3.s3.transfer.TransferConfig (use_threads=False) helps to some extent, but the occasional OSError still pops up.

From what I observed so far the most reliable mitigation for me was to reduce the multipart chunk size (multipart_chunksize e.g. to 1MB).

@cschloer
Copy link

Has anyone found a workaround for an application like Flask where one session cannot be instantiated globally?

@longbowrocks
Copy link

@cscholer cache and reuse sessions. A thread-local cash is fine.
That way you won't create way too many sessions.

@jbvsmo
Copy link
Author

jbvsmo commented Jul 17, 2020

@cschloer @longbowrocks
I created this issue 2 years ago and the situation is unchanged since. My solution at the time which is running today on hundreds of servers I have deployed is exactly that of a local cache that I add to the current thread object.

Below is the code I use (slightly edited) to replace the resource and client boto 3 functions that is thread safe and does not need to explicitly create sessions and your code doesn't need to be aware it is inside a separate thread. You might need to do some cleanup to avoid open file warnings when terminating threads.

There are limitations to this and I offer no guarantees. Use with caution.

import json
import hashlib
import time
import threading
import boto3.session

DEFAULT_REGION = 'us-east-1'
KEY = None
SECRET = None


class AWSConnection(object):
    def __init__(self, function, name, **kw):
        assert function in ('resource', 'client')
        self._function = function
        self._name = name
        self._params = kw

        if not self._params:
            self._identifier = self._name
        else:
            self._identifier = self._name + hash_dict(self._params)

    def get_connection(self):
        thread = threading.currentThread()

        if not hasattr(thread, '_aws_metadata_'):
            thread._aws_metadata_ = {
                'age': time.time(),
                'session': boto3.session.Session(),
                'resource': {},
                'client': {}
            }

        try:
            connection = thread._aws_metadata_[self._function][self._identifier]
        except KeyError:
            connection = create_connection_object(
                self._function, self._name, session=thread._aws_metadata_['session'], **self._params
            )
            thread._aws_metadata_[self._function][self._identifier] = connection

        return connection

    def __repr__(self):
        return 'AWS {0._function} <{0._name}> {0._params}'.format(self)

    def __getattr__(self, item):
        connection = self.get_connection()
        return getattr(connection, item)


def create_connection_object(function, name, session=None, region=None, **kw):
    assert function in ('resource', 'client')
    if session is None:
        session = boto3.session.Session()

    if region is None:
        region = DEFAULT_REGION

    key, secret = KEY, SECRET

    # Do not set these variables unless they were configured on parameters file
    # If they are not present, boto3 will try to load them from other means
    if key and secret:
        kw['aws_access_key_id'] = key
        kw['aws_secret_access_key'] = secret

    return getattr(session, function)(name, region_name=region, **kw)


def hash_dict(dictionary):
    """ This function will hash a dictionary based on JSON encoding, so changes in
        list order do matter and will affect result.
        Also this is an hex output, so not size optimized
    """
    json_string = json.dumps(dictionary, sort_keys=True, indent=None)
    return hashlib.sha1(json_string.encode('utf-8')).hexdigest()


def resource(name, **kw):
    return AWSConnection('resource', name, **kw)


def client(name, **kw):
    return AWSConnection('client', name, **kw)

@cschloer
Copy link

Really appreciate the (very) quick and thorough response @jbvsmo

You're solution mostly worked for me - I combined it with simply reducing the number of processes in my UWSGI config - I think I was expecting too much from my tiny (1GB memory) server so I reduced the # of processes from 10 to 5.

@kdaily kdaily added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jul 19, 2021
@kdaily kdaily removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Aug 27, 2021
almost added a commit to almost/django-s3-storage that referenced this issue Nov 26, 2021
Creating a new session each time S3Storage is instantiated creates a memory leak. It seems that S3Storage can get created a bunch of times (I'm seeing it get created again and again as my app runs) and a boto3's Session takes loads of memory (see boto/boto3#1670) so my app eventually runs out of memory. This should fix the issue while still avoiding using the same session across different threads.
@orShap
Copy link

orShap commented Dec 29, 2021

This is totally crazy. s3 client session drain out all our memory resources.

@foolishhugo
Copy link

We ran into this problem today. The memory leak was crashing our servers

@JanKuipersIRL
Copy link

Same here. Using S3 in a FastAPI service drains all memory, eventually crashing the service.

@JanRobroeks
Copy link

Is there any news on this issue? We run into the same problem which results in crashing servers. Without the boto3.session our servers consume < 200 MB while with boto3.session it accumulates to over 16GB until the servers crash.

I'll try this work around for now:
(#1670 (comment))

@shughes-uk
Copy link

shughes-uk commented Oct 7, 2023

Is there any pressing reason each client needs its own copy of these massive JSON blobs? Our situation is that we are handling many many different sets of credentials. Caching the session or client object gives a tiny benefit at best, reusing each client/session with different credentials seems like it would be a nightmare of race conditions and potential security issues.

Creating client/sessions is brutal on memory, and also incredibly slow (a full second).

If the JSON blob was simply loaded once and then all clients/sessions in every thread simply referred to it, that seems like it would solve our problem. Are these blobs being mutated by sessions/clients in some way? They seem to be AWS service maps so I'm assuming not.

--- edit ---

I performed a hacky tested by wrapping botocore.loaders.create_loader in functools.lru_cache. I believe this means each client/session will share a global Loader, and thus a global botocore.loaders.instance_cache, preventing excessing loading.

This cut create_client from upwards of 500ms to consistently sub 100ms, with memory remaining stable. At our scale this is a big deal!

No doubt my hack introduced race conditions between threads (please nobody replicate+ deploy this!), but serves as a proof of concept of this being a valuable improvement to some.

@sparrowt
Copy link

sparrowt commented Nov 28, 2023

Yes it would be much better if the data loaded from these JSON files were either shared across all sessions & clients in a thread-safe way, or possibly even just replaced by python modules containing the same data (importlib gives thread safety for free).

Stripping documentation to save memory

In the mean time, I have obtained a modest saving (up to ~2 MB memory per client depending on the AWS service) by recursively blanking the values of the keys "documentation" and "documentationUrl" throughout all service-2.json files under site-packages/botocore/data after pip install during the build step. I tried removing the keys entirely but that caused issues, so instead just replace the value with "".

For example this reduces the S3 service definition (site-packages/botocore/data/s3/2006-03-01/service-2.json) from 833 KB down to 298 KB (or 193 KB if you json dump it with indent=None though this won't affect the memory usage of the loaded dict) so that is at least a 64% reduction 🥳

Here's the basic idea: use glob to find all the service-2.json files and then for each one:

  1. load the json in the same way that botocore/loaders.py does in case some of the specifics are important
  2. then pass the obj to this recursive blanking function with keys_to_blank=["documentation", "documentationUrl"]
def blank_values_by_key_recurse(obj, keys_to_blank: list[str]):
    if isinstance(obj, dict):
        for key in list(obj.keys()):
            if key in keys_to_blank:
                obj[key] = ''
            else:
                blank_values_by_key_recurse(obj[key], keys_to_blank)
    elif isinstance(obj, list):
        for item in obj:
            blank_values_by_key_recurse(item, keys_to_blank)
    # else: do nothing, it's a leaf value
  1. then write out the modified obj replacing the original JSON in site-packages e.g. like this:
with open(path, 'w', encoding='utf8') as fp:
    # ensure_ascii=False gives a closer match to what botocore ships: unicode characters present rather than \uNNNN
    json.dump(dict_obj, fp, indent=indent, separators=(',', ':')), ensure_ascii=False)

Stripping out unused endpoints/partitions

I also tried a more drastic step of stripping down endpoints.json by entirely deleting services that I didn't use from within "services" which in my case massively reduced the memory usage of every session (from ~6mb down to almost nothing, so small I couldn't even see it in an Austin profiler memory allocation trace). However it also seemed to cause unexplained memory spikes which I was not able to understand or resolve so I've sadly had to park that for now.

It did seem to be fine however to remove the partitions that I wasn't interested in (everything apart from the standard "aws") which reduces endpoints.json by ~15% which is still worthwhile given that is loaded in every session (i.e. once per thread even in the best case where you cache sessions per thread and reuse).

Of course the best route would be for the boto team to engage on this issue and consider proper fixes like those proposed at the start of this comment. EDIT: I've filed boto/botocore#3078 with specifics and proposed improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-review p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests