Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical data is loaded into every session wasting memory #3078

Open
sparrowt opened this issue Nov 28, 2023 · 3 comments
Open

Identical data is loaded into every session wasting memory #3078

sparrowt opened this issue Nov 28, 2023 · 3 comments
Labels
bug This issue is a confirmed bug. p3 This is a minor priority issue

Comments

@sparrowt
Copy link

sparrowt commented Nov 28, 2023

Describe the bug

Each botocore Session creates its own instance of a Loader within which JSON content loaded from botocore/data/ is cached by @instance_cache e.g. on methods like load_service_model & load_data_with_path.

This caching applies to many things including loading endpoints.json into an EndpointResolver which happens in every session and results in approx 6 MB of memory allocation (to load details of all the HTTP endpoints for every region/partition for all 300+ AWS services).

The JSON files shipped with botocore presumably do not change on disk at runtime. Nevertheless if you create several sessions within a process - e.g. in a multi-threaded app because sessions are not thread safe - this exact same data is loaded into memory multiple times and cached separately in each Session's Loader and in its EndpointResolver.

It seems therefore like a bug (of the wasteful memory usage variety) that the immutable JSON cache is per-session rather than per-process. In a multi-threaded app in a resource-constrained environment every 6MB really adds up.

Expected Behavior

When creating a 2nd (and any subsequent) Session, the data which has already been loaded from endpoints.json should be re-used, quickly and without unnecessary extra memory allocation.

Current Behavior

Instead, each new session actually loads the whole thing in again resulting in another ~6 MB of memory usage each time, storing it in a new EndpointResolver (and Loader) with the new Session. (The same issue exists for other JSON data such as service definitions but I'm just focussing on the most common & most impactful example that I observed.)

Reproduction Steps

import boto3.session

# Note: in real usage the below would be in separate threads (otherwise we could just re-use the Session)
# but the threading code is omitted from this example for brevity and because it does not affect the repro

# In one thread
session = boto3.session.Session(region_name='us-east-1')  # +6mb
client = session.client('s3')  # (+5mb as it happens)
# do stuff with client

# In another thread
session = boto3.session.Session(region_name='us-east-1')  # +6mb again = REPRO: this shouldn't need to load endpoints.json again
client = session.client('someotherservice')
# do stuff with client

Possible Solution

One solution would be to make the Loader process-wide with suitable locking on state as necessary. I imagine the small extra overhead is more than paid for by the memory savings if many sessions/clients are created.

A more radical alternative would be for the pre-processing step that generates the botocore/data/ to spit out, instead of each JSON file, a python module (.py file) containing a dict with the same data. Then Loader doesn't have to load JSON, it just lazily imports the python files it needs and python's importlib gives you the process-wide sharing and thread safety for free. I imagine this would be a much more difficult change having seen the existence of things like CUSTOMER_DATA_PATH (~/.aws/models/) so it may not be feasible - but I've included it if nothing else for hypothetical comparison and to illustrate the principle of the problem.

Additional Information/Context

boto/boto3#1670 is very related - this ticket is an attempt at a detailed description of why each session increases memory usage so much and how this might be avoided.

Any

SDK version used

botocore==1.33.1 boto3==1.33.1

Environment details (OS name and version, etc.)

Windows 10 Ubuntu WSL / same happens on Amazon Linux 2

@sparrowt sparrowt added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Nov 28, 2023
@tim-finnigan
Copy link
Contributor

Hi @sparrowt thanks for reaching out. Have you tried sharing a single loader instance across several sessions? For example:

from botocore.loaders import Loader

loader = Loader()
sessions = some_func_that_makes_multiple_sessions()
for session in session:
    session.register_component('data_loader', loader)

Another option is using a single session to create multiple clients which get passed to the other threads:

session = boto3.session.Session(region_name='us-east-1')
client1 = session.client('s3')
client2 = session.client('someotherservice')

# In one thread
client1.do_something()

# In another thread
client2.do_something()

The endpoints.json file itself is relatively small and only a small fraction of what’s causing the memory usage. The suggestions you described could involve extensive refactoring and I can't guarantee that those changes would be considered. I think it would help to have memory profile reports here that highlight the current memory usage you're seeing and how it compares with the approaches provided above.

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Dec 8, 2023
@sparrowt
Copy link
Author

sparrowt commented Dec 11, 2023

Thanks so much for getting back to me @tim-finnigan. I have not tried that, I assumed Loader was not thread safe (otherwise why would each session need its own?) - before I do could you clarify a couple of things:

  1. is Loader thread safe?
  2. if so, is there any reason not to make this the default behaviour? (i.e. all sessions using the same 'data_loader' component, I guess unless they specify a non-default 'data_path')

To respond to some of your other points:

Another option is using a single session to create multiple clients which get passed to the other threads

Sadly this is not really an option in my case: the app in question is a multi-threaded web server and it is not possible to predict in advance which boto3 clients any given thread might need, so because Session is not thread safe, each thread has to create its own session in order to create the client(s) it needs. I am already caching that session using threading.local() so that subsequent client creations in the same thread don't need to make another session.

The endpoints.json file itself is relatively small and only a small fraction of what’s causing the memory usage.

It is 781KB (only surpassed by a handful of the service definitions) however loading it into memory in python results in nearly 6 MB of memory allocation according to analysis using the Austin profiler e.g. in the memory allocation profile trace below where I did session = boto3.session.Session(region_name='us-east-1') and then client = session.client('s3') you see that within create_default_resolver where it loads endpoints.json there is 5.86 MB of memory allocation, which is quite a large fraction of the total memory allocated, the other major parts (to the right in the trace below) being smaller and s3 specific _load_service_model (4.71 MB) and _load_service_endpoints_ruleset (2.05 MB):
image

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Dec 19, 2023
@jimdigriz
Copy link

jimdigriz commented Jan 27, 2024

Have you tried sharing a single loader instance across several sessions? For example:
[snipped]

Found that each call of boto3.session.Session() (not using the default session) eat 200ms of wall clock time which lead me to this issue after I noticed all the stat/JSON parsing in my profiling; so not a RAM problem per se for me but a really expensive set up time.

The example was not exactly clear to me but did point me in the right direction on what I should be trying, so as a note for others, this got my ~200ms down to ~20ms per call to create and S3 Resource:

# preload and reuse the model to shave ~200ms each time we create a session
# https://github.com/boto/boto3/issues/1670
# https://github.com/boto/botocore/issues/3078
_loader = botocore.loaders.Loader()
# iterate contents of botocore/data/s3
for type_name in frozenset(['endpoint-rule-set-1', 'paginators-1', 'service-2', 'waiters-2']):
    _loader.load_service_model(service_name='s3', type_name=type_name)
# session *instantiation* is not safe either
_boto_session_lock = threading.Lock()
def _session():

    session = botocore.session.get_session()
    session.register_component('data_loader', _loader)
    with _boto_session_lock:
        return boto3.session.Session(region_name=region_name, botocore_session=session)

Then in your threads later you can use the following for a significantly faster setup time:

session = _session()
#session.events.register(...)
resource = session.resource('s3', config=config, endpoint_url=AWS_ENDPOINT_URL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

3 participants