DeflatePerMessage VS memory consumption #1900

Kludex · 2023-02-03T08:29:53Z

Kludex
Feb 3, 2023
Maintainer Sponsor

Discussed in #1850

^{Originally posted by M1ha-Shvn January 27, 2023}
Hi.
I'm devloping a server serving websockets connections using FastAPI.
I've noticed, that creating several thousands of simultinious websocket connections leads to high memory usage (4 Gb per 5-8k websocket connections in my case). I've started debugging it with tracemalloc and found out that the largest amount of this memory is consumed by websockets deflate extension in this line.
After that I've digged into websockets deflate mechanics and found out that it can be tuned wisely in order to achive lower memory consumption using custom ServerPerMessageDeflateFactory. I've tried searching for it in FastApi => Starlette => Uvicorn code and it lead me here.
What is the source of memory leak:

Uvicorn creates separate PerMessageDeflate instance for each websocket connection (using ServerPerMessageDeflateFactory). From my point of view, it is a disadvantageous behaviour: it would be much better if it is created not for each websocket connection, but for each combination of connection parameters (like Singleton pattern, but created for each combination of parameters. Something like lru_cache).
The only setting uvicorn gives to tune deflate is just disabling it with --ws-per-message-deflate. It is not flexible for different cases.

Kludex · 2023-02-03T08:30:54Z

Kludex
Feb 3, 2023
Maintainer Author Sponsor

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

0 replies

Kludex · 2023-02-03T08:31:07Z

Kludex
Feb 3, 2023
Maintainer Author Sponsor

Also, do you have a MRE for me to confirm the issue?

0 replies

M1ha-Shvn · 2023-02-03T09:26:31Z

M1ha-Shvn
Feb 3, 2023

I'll try making MRE and PR a little bit later, no time today. What I can give you fast is the following log of tracemalloc. It has bin made in conditions:

--workers = 16
tracemalloc snapshot is dumped by asyncio.created_task every 10 minutes so I could compare snapshots
Here is one of them. At this very moment, ~3000 active websockets are opened (I've made a tester based on websockets library in order to open them, traffic to sockets is generated from production server, it's something like 200 messages per second, but different websockets receive only small amount of this traffic)
Next trace showed that the only thing that grew up in size significantly was self.encoder = zlib.compressobj (250+Mb for ~5000 simultinious sockets)

Line stat
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:64: size=148 MiB, count=4040, average=37.4 KiB
/home/ubuntu2/fastapi-rts/core/models.py:292: size=69.8 MiB, count=40733, average=1797 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:61: size=4148 KiB, count=1734, average=2449 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py:122: size=1691 KiB, count=28868, average=60 B
/usr/lib/python3.8/json/decoder.py:353: size=1283 KiB, count=13772, average=95 B
/home/ubuntu2/fastapi-rts/core/models.py:809: size=1280 KiB, count=1, average=1280 KiB
/home/ubuntu2/fastapi-rts/core/models.py:790: size=861 KiB, count=12244, average=72 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py:173: size=832 KiB, count=16619, average=51 B
/home/ubuntu2/fastapi-rts/core/models.py:789: size=730 KiB, count=18876, average=40 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py:299: size=680 KiB, count=1329, average=524 B


Traceback stat

4040 memory blocks: 151178.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 64
    self.encoder = zlib.compressobj(

40733 memory blocks: 71484.1 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 292
    res += pickle.dumps(self.data, fix_imports=False)

1734 memory blocks: 4147.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 61
    self.decoder = zlib.decompressobj(wbits=-self.remote_max_window_bits)

28868 memory blocks: 1691.0 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py", line 122
    self._dict.setdefault(key.lower(), []).append(value)

13772 memory blocks: 1283.5 KiB
  File "/usr/lib/python3.8/json/decoder.py", line 353
    obj, end = self.scan_once(s, idx)

1 memory blocks: 1280.0 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 809
    self._cache[channel_name] = self._cache.pop(channel_name)

12244 memory blocks: 861.3 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 790
    messages=UniqueSortedList((compressed_msg,)))

16619 memory blocks: 832.2 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 173
    (name.encode("ascii"), value.encode("ascii"))

18876 memory blocks: 729.8 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 789
    self._cache[channel_name] = CacheElement(last_update=msg.created.timestamp(),

1329 memory blocks: 679.8 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py", line 299
    self._breadcrumbs = deque()  # type: Deque[Breadcrumb]

As you can see, self.encoder = zlib.compressobj( allocates150Mb per worker => ~1Gb per app
Here is the graph of memory commited by the server, where uvicorn is served:

At the very beginning app has been rebooted. After that I started creating sockets with 0.05 delay between every socket.
When there are no sockets created, graph looks like this:

I've also tested gunicorn with Uvicorn worker class with the same memory leaking effect

0 replies

M1ha-Shvn · 2023-02-03T09:40:04Z

M1ha-Shvn
Feb 3, 2023

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

What I suggest, is one of the following:

Inherit ServerPerMessageDeflateFactory, so that it creates a single instance of PerMessageDeflate every time it is called, not new instance on every call. This instance should be stored somewhere between calls, in ServerPerMessageDeflateFactory child class attribute, for instance. And use this inherited class as extension instead of ServerPerMessageDeflateFactory,
Optionally add some tuning params, described here and pass them to ServerPerMessageDeflateFactory, when instance is created

0 replies

M1ha-Shvn · 2023-02-06T12:03:16Z

M1ha-Shvn
Feb 6, 2023

MRE: https://github.com/M1ha-Shvn/uvicorn-issue-1862

0 replies

Kludex · 2023-02-06T12:08:55Z

Kludex
Feb 6, 2023
Maintainer Author Sponsor

Thanks!

0 replies

M1ha-Shvn · 2023-02-06T12:34:40Z

M1ha-Shvn
Feb 6, 2023

I haven't tested this PR, made it fast in order to show general idea
#1864

0 replies

Kludex · 2023-02-06T13:47:13Z

Kludex
Feb 6, 2023
Maintainer Author Sponsor

@aaugustin I'm sorry to bother, but when you have time, would you mind giving me your input here to understand if we are going in the right direction? 🙏

0 replies

aaugustin · 2023-02-06T16:30:15Z

aaugustin
Feb 6, 2023
Collaborator

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

If you aren't using context takeover, then you don't have a memory usage problem.

0 replies

aaugustin · 2023-02-06T17:04:49Z

aaugustin
Feb 6, 2023
Collaborator

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

0 replies

M1ha-Shvn · 2023-02-07T05:07:03Z

M1ha-Shvn
Feb 7, 2023

Hi, thanks for your answer.

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

First of all, I understand that I've changed this behavior. But what I don't understand: why deflate object is created per connection? Why can not Deflate context be shared between all connections? From my point of view, in a typical websocket server you have lots of connections with clients and send very similar (particularly, JSON) messages to all clients => to all connections. So it would be worth using same deflate object with single context and increasing it's capacity, so it could compress messages better. Of course, it depends on connection settings, set by request headers. But there is a small number of combinations of these parameters => A limited number of objects can be created and used.

If you aren't using context takeover, then you don't have a memory usage problem.

I'm not so sure about it. Creating an object in python has its memory cost. Even if I won't use context takeover, objects PerMessageDeflate would be created and use memory for each connection.
P. s. There's also no ability in uvicorn for now to tune PerMessageDeflateFactory or PerMessageDeflate, only disabling it.

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

Yes, adding deflate tuning settings to uvicorn settings was one of my proposals. Though, it would lower the problem for me, but would not solve it: I'll just have higher connection limit still consuming lots of memory per each connection

0 replies

aaugustin · 2023-02-07T06:29:05Z

aaugustin
Feb 7, 2023
Collaborator

Kludex asked for my inputs; I gave them. If you don't trust me, make your experiments and reach your own conclusions.

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

0 replies

aaugustin · 2023-02-07T06:32:52Z

aaugustin
Feb 7, 2023
Collaborator

In case it helps:

The context is synchronized between both ends of the connection and its content depends on messages that are exchanged.
When context takeover is disabled, compressor / decompressor objects will be garbage collected as soon as compression / decompression is done.

0 replies

Kludex · 2023-02-07T06:34:01Z

Kludex
Feb 7, 2023
Maintainer Author Sponsor

Thanks @aaugustin! I really appreciate it. 🙏

0 replies

Kludex · 2023-03-09T21:46:23Z

Kludex
Mar 9, 2023
Maintainer Author Sponsor

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

Did you try this @M1ha-Shvn ?

0 replies

acnebs · 2023-03-19T05:46:25Z

acnebs
Mar 19, 2023

But it's not really possible to directly disable context takeover from uvicorn, correct?

I'm also hitting this issue.

0 replies

reportingissue · 2023-09-26T08:25:00Z

reportingissue
Sep 26, 2023

@Kludex @aaugustin
I've just spent almost the whole day digging up this issue, and would like to propose fixes. I would be glad if you guys could validate my train of thought.

In my use case, most of my request and response data is small json, but sporadically, there will be large chunks, making compression beneficial. However, while those connections are long running, the activity pattern per connection is bursty in nature, making the current settings not memory efficient.

1 Context Takeover

The first step would be to expose server_no_context_takeover=False and client_no_context_takeover=False in https://github.com/encode/uvicorn/blob/master/uvicorn/protocols/websockets/websockets_impl.py#L102 and https://github.com/encode/uvicorn/blob/master/uvicorn/protocols/websockets/wsproto_impl.py#L268

It would be best to keep the default at False is best as high traffic short lived connection deployments are more common, making context takeover beneficial relative to memory costs due to ingress and egress costs.

Then we can expose those parameters by:

adding a ws_no_context_takeover config attribute at https://github.com/encode/uvicorn/blob/master/uvicorn/config.py
adding a click option --ws-no-context-takeover at https://github.com/encode/uvicorn/blob/master/uvicorn/main.py to set that entry from the cli
passing that config attribute back in wsproto_impl.py and websockets_impl.py

There would only be one toggle to set both client and server context, because if your server is memory starved from connection quantity, you wouldn't care to enable context takeover for just one direction.

This would allow both current implementations to discard their zlib compressobj and decompressobj instances after every message sent or received.

2 Max window bits

The next step would be to expose the server_max_window_bits=None and client_max_window_bits=None as well as in websockets_impl.py as well as wsproto_impl.py. The number of bits represents the "lookbehind" window when using context takeover.

Websockets says that they set theirs by default at 12 bits https://websockets.readthedocs.io/en/stable/topics/compression.html#for-servers, so a 4kb max window, while WSProto sets theirs to 15 bits https://github.com/python-hyper/wsproto/blob/main/src/wsproto/extensions.py#L65, so a 32kb max window, which is quite large when you have thousands of connections.

We should be capable of setting both the client and server max bit windows because each server connection requires them for decompressing input and compressing output.

I would suggest the config.py and main.py options be --ws-context-client-wbits and --ws-context-server-wbits, both having no effect if --ws-no-context-takeover is set. It does lack the "deflate" name, but --ws-deflate-context-client-wbits is too long and --ws-deflate-client-wbits seems 2 levels of relationship vs 1.

While both, websockets and wsproto use zlib's compressobj, only websockets exposes the memory level, so not exposing that option should be "fine".

3 Cpu performance

Currently, having deflate on forces context takeover. There is a permenent compressobj for each connection. Once we expose the ws_no_context_takeover, the PerMessageDeflate extension for both websockets and wsproto will create a new compressobj on every message, then drop it.
https://github.com/python-websockets/websockets/blob/main/src/websockets/extensions/permessage_deflate.py#L119
https://github.com/python-websockets/websockets/blob/main/src/websockets/extensions/permessage_deflate.py#L137
https://github.com/python-hyper/wsproto/blob/main/src/wsproto/extensions.py#L282
https://github.com/python-hyper/wsproto/blob/main/src/wsproto/extensions.py#L298

This is probably probably not good.

Python 3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> from zlib import compressobj as co
>>> process = psutil.Process()
>>> start_kb = process.memory_info().rss // 1024
>>> start_kb
20544
>>> cs = [co() for i in range(1000)]
>>> after_cs_kb = process.memory_info().rss // 1024
>>> after_cs_kb
113636
>>> kb_per_co = (after_cs_kb - start_kb) // 1000
>>> kb_per_co
93

93kb allocated and deallocated per message received "not good", before windows and strategy lookup dictionaries are filled.

It would be good if websockets and wsproto used the zlib.compress function when there is no context takeover as there is no "stream" to compress, the entire message must fit into memory, despite lacking deeper options.

The same cannot be done with decompression however. decompressobj.decompress has a max_length parameter, which websockets uses wisely to prevent zip bombs.
https://github.com/python-websockets/websockets/blob/main/src/websockets/extensions/permessage_deflate.py#L129

Wsproto simply calls decompress with default options, making it zip bomb susceptible.
https://github.com/python-hyper/wsproto/blob/main/src/wsproto/extensions.py#L227

Unfortunately, the objectless zlib.decompress function does not have the max_length option, so I guess we shall have to live with this for now.

4 zlib note

Unrelated to the websockets and uvicorn, but it would be nice to have the zdict option available on zlib.compress when users know their common json keys but don't want to use context takeover.

0 replies

saveriyo · 2024-04-10T17:06:50Z

saveriyo
Apr 10, 2024

Hi @reportingissue @Kludex @aaugustin I ran into a similar memory leak issue that may be resolved by disabling per message deflate or enabling no context takeover. However, do you have a recommended near term fix to enable no context takeover cleanly?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeflatePerMessage VS memory consumption #1900

{{title}}

Replies: 18 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

DeflatePerMessage VS memory consumption #1900

Kludex Feb 3, 2023 Maintainer Sponsor

Discussed in #1850

Replies: 18 comments

Kludex Feb 3, 2023 Maintainer Author Sponsor

Kludex Feb 3, 2023 Maintainer Author Sponsor

Kludex Feb 6, 2023 Maintainer Author Sponsor

Kludex Feb 6, 2023 Maintainer Author Sponsor

aaugustin Feb 6, 2023 Collaborator

aaugustin Feb 6, 2023 Collaborator

aaugustin Feb 7, 2023 Collaborator

aaugustin Feb 7, 2023 Collaborator

Kludex Feb 7, 2023 Maintainer Author Sponsor

Kludex Mar 9, 2023 Maintainer Author Sponsor

1 Context Takeover

2 Max window bits

3 Cpu performance

4 zlib note

Kludex
Feb 3, 2023
Maintainer Sponsor

Kludex
Feb 3, 2023
Maintainer Author Sponsor

Kludex
Feb 3, 2023
Maintainer Author Sponsor

Kludex
Feb 6, 2023
Maintainer Author Sponsor

Kludex
Feb 6, 2023
Maintainer Author Sponsor

aaugustin
Feb 6, 2023
Collaborator

aaugustin
Feb 6, 2023
Collaborator

aaugustin
Feb 7, 2023
Collaborator

aaugustin
Feb 7, 2023
Collaborator

Kludex
Feb 7, 2023
Maintainer Author Sponsor

Kludex
Mar 9, 2023
Maintainer Author Sponsor