Continuous memory leak #4843

marvelph · 2018-06-23T05:04:37Z

There is a memory leak in the parent process of Celery's worker.
It is not a child process executing a task.
It happens suddenly every few days.
Unless you stop Celery, it consumes server memory in tens of hours.

This problem happens at least in Celery 4.1, and it also occurs in Celery 4.2.
Celery is running on Ubuntu 16 and brokers use RabbitMQ.

georgepsarakis · 2018-06-23T05:16:52Z

Are you using Canvas workflows? Maybe #4839 is related.

Also I assume you are using prefork pool for worker concurrency?

marvelph · 2018-06-23T05:38:21Z

Thanks georgepsarakis.

I am not using workflow.
I use prefork concurrency 1 on single server.

georgepsarakis · 2018-06-23T05:44:59Z

The increase rate seems quite linear, quite weird. Is the worker processing tasks during this time period? Also, can you add a note with the complete command you are using to start the worker?

marvelph · 2018-06-23T05:52:47Z

Yes. The worker continues to process the task normally.

The worker is started with the following command.

/xxxxxxxx/bin/celery worker --app=xxxxxxxx --loglevel=INFO --pidfile=/var/run/xxxxxxxx.pid

marvelph · 2018-06-23T06:08:10Z

This problem is occurring in both the production environment and the test environment.
I can add memory profile and test output to the test environment.
If there is anything I can do, please say something.

georgepsarakis · 2018-06-23T06:37:08Z

We need to understand what the worker is running during the time that the memory increase is observed. Any information and details you can possibly provide would definitely. It is also good that you can reproduce this.

marvelph · 2018-06-23T08:08:18Z

Although it was a case occurred at a timing different from the graph, the next log was output at the timing when the memory leak started.

[2018-02-24 07:50:52,953: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 320, in start
blueprint.start(self)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 293, in create_loop
poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 136, in fire_timers
entry()
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 68, in __call__
return self.fun(*self.args, **self.kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 127, in _reschedules
return fun(*args, **kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/connection.py", line 290, in heartbeat_check
return self.transport.heartbeat_check(self.connection, rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/transport/pyamqp.py", line 149, in heartbeat_check
return connection.heartbeat_tick(rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 696, in heartbeat_tick
self.send_heartbeat()
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 647, in send_heartbeat
self.frame_writer(8, 0, None, None, None)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/method_framing.py", line 166, in write_frame
write(view[:offset])
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/transport.py", line 258, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-02-24 08:49:12,016: INFO/MainProcess] Connected to amqp://xxxxxxxx:**@xxx.xxx.xxx.xxx:5672/xxxxxxxx

It seems that it occurred when the connection with RabbitMQ was temporarily cut off.

georgepsarakis · 2018-06-24T05:31:24Z

@marvelph so it occurs during RabbitMQ reconnections? Perhaps these issues are related:

marvelph · 2018-06-24T06:51:28Z

Yes.
It seems that reconnection triggers it.

jxltom · 2018-06-25T02:22:25Z

It looks like I'm having the same issue... It is so hard for me to find out what triggers it and why there is a memeory leak. It annoys me for at least a month. I fallback to used celery 3 and everything is fine.

For the memory leak issue, I'm using ubuntu 16, celery 4.1.0 with rabbitmq. I deployed it via docker.

The memory leak is with MainProcess not ForkPoolWorker. The memory usage of ForkPoolWorker is normal, but memory usage of MainProcess is always increasing. For five seconds, around 0.1MB memeory is leaked. The memory leak doesn't start after the work starts immediatly but maybe after one or two days.

I used gdb and pyrasite to inject the running process and try to gc.collect(), but nothing is collected.

I checked the log, the consumer: Connection to broker lost. Trying to re-establish the connection... did happens, but for now I'm not sure this is the time when memory leak happens.

Any hints for debugging this issue and to find out what really happens? Thanks.

jxltom · 2018-06-25T02:59:16Z

Since @marvelph mentioned it may relate with rabbitmq reconnection, I try to stop my rabbitmq server. The memory usage did increase after each reconnection, following is the log. So I can confirm this celery/kombu#843 issue.

But after the connection is reconnected, the memory usage stops to gradually increase. So I'm not sure this is the reason for memory leak.

I will try to use redis to figure out whether this memory leak issue relates wtih rabbitmq or not.

[2018-06-25 02:43:33,456: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 236, in on_readable
    reader(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 218, in _read
    drain_events(timeout=0)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 491, in drain_events
    while not self.blocking_read(timeout):
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 496, in blocking_read
    frame = self.transport.read_frame()
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 243, in read_frame
    frame_header = read(7, True)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 418, in _read
    s = recv(n - len(rbuf))
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:43:33,497: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2018-06-25 02:43:35,526: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

[2018-06-25 02:43:39,560: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

[2018-06-25 02:43:45,599: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 8.00 seconds...

[2018-06-25 02:43:53,639: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 10.00 seconds...

[2018-06-25 02:44:03,680: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 12.00 seconds...

[2018-06-25 02:44:15,743: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 14.00 seconds...

[2018-06-25 02:44:29,790: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 16.00 seconds...

[2018-06-25 02:44:45,839: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 18.00 seconds...

[2018-06-25 02:45:03,890: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 20.00 seconds...

[2018-06-25 02:45:23,943: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 22.00 seconds...

[2018-06-25 02:45:46,002: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 24.00 seconds...

[2018-06-25 02:46:10,109: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,212: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:10,291: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 40, in start
    self.sync(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 44, in sync
    replies = self.send_hello(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 57, in send_hello
    replies = inspect.hello(c.hostname, our_revoked._data) or {}
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 132, in hello
    return self._request('hello', from_node=from_node, revoked=revoked)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 84, in _request
    timeout=self.timeout, reply=True,
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 439, in broadcast
    limit, callback, channel=channel,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 315, in _broadcast
    serializer=serializer)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 290, in _publish
    serializer=serializer,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py", line 1732, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 275, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:46:10,375: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,526: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:11,764: INFO/MainProcess] mingle: all alone

marvelph · 2018-06-25T04:09:16Z

Although I checked the logs, I found a log of reconnection at the timing of memory leak, but there was also a case where a memory leak started at the timing when reconnection did not occur.
I agree with the idea of jxlton.

Also, when I was using Celery 3.x, I did not encounter such a problem.

dmitry-kostin · 2018-06-25T05:14:35Z

same problem here

Every few days i have to restart workers due to this problem
there are no any significant clues in logs, but I have a suspicion that reconnects can affect; since i have reconnect log entries somewhere in time when memory starts constantly growing
My conf is ubuntu 17, 1 server - 1 worker with 3 concurrency; rabbit and redis on backend; all packages are the latest versions

georgepsarakis · 2018-06-25T05:29:01Z

@marvelph @dmitry-kostin could you please provide your exact configuration (omitting sensitive information of course) and possibly a task, or sample, that reproduces the issue? Also, do you have any estimate of the average uptime interval that the worker memory increase starts appearing?

dmitry-kostin · 2018-06-25T05:40:31Z

the config is nearby to default

imports = ('app.tasks',)
result_persistent = True
task_ignore_result = False
task_acks_late = True
worker_concurrency = 3
worker_prefetch_multiplier = 4
enable_utc = True
timezone = 'Europe/Moscow'
broker_transport_options = {'visibility_timeout': 3600, 'confirm_publish': True, 'fanout_prefix': True, 'fanout_patterns': True}

Basically this is new deployed node; it was deployed on 06/21 18-50; stared to grow 6/23 around 05-00 and finally crashed 6/23 around 23-00

the task is pretty simple and there is no superlogic there, i think i can reproduce the whole situation on a clear temp project but have no free time for now, if i will be lucky i will try to do a full example on weekend

UPD
as you can see the task itself consumes some memory you can see it by spikes on the graph, but the time when memory stared to leak there were no any tasks produced or any other activities

georgepsarakis · 2018-06-25T12:21:18Z

@marvelph @dmitry-kostin @jxltom I noticed you use Python3. Would you mind enabling tracemalloc for the process? You may need to patch the worker process though to log memory allocation traces, let me know if you need help with that.

jxltom · 2018-06-25T12:28:33Z

@georgepsarakis You mean enable tracemalloc in worker and log stats, such as the top 10 memory usage files, at a specific interval such as 5 minutes?

georgepsarakis · 2018-06-25T12:35:55Z

@jxltom I think something like that would help locate the part of code that is responsible. What do you think?

jxltom · 2018-06-25T14:53:16Z

@georgepsarakis I'v tried to use gdb and https://github.com/lmacken/pyrasite to inject the memory leak process, and start debug via tracemalloc. Here is the top 10 file with highest mem usage.

I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 and the memory usage is gradually increasing indeed.

>>> import tracemalloc
>>> 
>>> tracemalloc.start()
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/kombu/utils/eventio.py:84: size=12.0 KiB, count=1, average=12.0 KiB
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=3520 B, count=8, average=440 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=3264 B, count=12, average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=3060 B, count=10, average=306 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=2816 B, count=12, average=235 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=2816 B, count=8, average=352 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=2672 B, count=6, average=445 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=2592 B, count=8, average=324 B

Here is the difference between two snapshots after around 5 minutes.

>>> snapshot2 = tracemalloc.take_snapshot()
>>> top_stats = snapshot2.compare_to(snapshot, 'lineno')
>>> print("[ Top 10 differences ]")
[ Top 10 differences ]

>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=220 KiB (+216 KiB), count=513 (+505), average=439 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=211 KiB (+208 KiB), count=758 (+748), average=285 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=210 KiB (+206 KiB), count=789 (+777), average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=190 KiB (+187 KiB), count=530 (+522), average=366 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=186 KiB (+183 KiB), count=524 (+516), average=363 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=185 KiB (+182 KiB), count=490 (+484), average=386 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=182 KiB (+179 KiB), count=528 (+520), average=353 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=179 KiB (+176 KiB), count=786 (+774), average=233 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=165 KiB (+163 KiB), count=525 (+517), average=323 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/async/hub.py:293: size=157 KiB (+155 KiB), count=255 (+251), average=632 B

jxltom · 2018-06-25T14:59:28Z

Any suggestions for how to continue to debug this? I have no clue for how to proceed. Thanks.

marvelph · 2018-06-26T00:57:53Z

@georgepsarakis

I want a little time to cut out the project for reproduction.

It is setting of Celery.

BROKER_URL = [
    'amqp://xxxxxxxx:yyyyyyyy@aaa.bbb.ccc.ddd:5672/zzzzzzzz'
]
BROKER_TRANSPORT_OPTIONS = {}

The scheduler has the following settings.

CELERYBEAT_SCHEDULE = {
    'aaaaaaaa_bbbbbbbb': {
        'task': 'aaaa.bbbbbbbb_cccccccc',
        'schedule': celery.schedules.crontab(minute=0),
    },
    'dddddddd_eeeeeeee': {
        'task': 'dddd.eeeeeeee_ffffffff',
        'schedule': celery.schedules.crontab(minute=0),
    },
}

On EC 2, I am using supervisord to operate it.

marvelph · 2018-06-26T01:01:47Z

@georgepsarakis
Since my test environment can tolerate performance degradation, you can use tracemalloc.
Can you make a patched Celery to dump memory usage?

dmitry-kostin · 2018-06-26T04:03:17Z

@jxltom I bet tracemalloc with 5 minutes wont help to locate problem
For example I have 5 nodes and only 3 of them had this problem for last 4 days, and 2 worked fine all this this time, so it will be very tricky to locate problem ..
I feel like there is some toggle that switches on and then memory starts grow, until this switch memory consumption looks very well

marvelph · 2018-06-26T04:15:01Z

I tried to find out whether similar problems occurred in other running systems.
The frequency of occurrence varies, but a memory leak has occurred on three systems using Celery 4.x, and it has not happened on one system.
The system that has a memory leak is Python 3.5.x, and the system with no memory leak is Python 2.7.x.

jxltom · 2018-06-26T04:58:37Z

@dmitry-kostin What's the difference with the other two normal nodes, are they both using same rabbitmq as broker?

Since our discussion mentioned it may related to rabbitmq, I started another new node with same configuration except for using redis instead. So far, this node has no memory leak after running 24 hours. I will post it here if it has memory leak later

jxltom · 2018-06-26T05:01:25Z

@marvelph So do you mean that the three system with memory leak are using python3 while the one which is fine is using python2?

dmitry-kostin · 2018-06-26T05:08:46Z

@jxltom no difference at all, and yes they are on python 3 & rabit as broker and redis on backend
I made a testing example to reproduce this, if it will succeed in a couple of days i will give credentials to this servers for somebody who aware how to locate this bug

marvelph · 2018-06-26T05:54:00Z

@jxltom
Yes.
As far as my environment is concerned, problems do not occur in Python 2.

pawl · 2021-12-24T05:09:47Z

The pull request I made today with the fix for the Redis broker leaking memory (when connections to the broker fail) was just merged.

I'm not aware of any other ways to reproduce memory leaks for #4843 at the moment.

Here's a summary of the fixes so far:

These fixes should completely prevent leaks due to disconnected connections to the broker:

RabbitMQ broker & using py-amqp - @michael-lazar's Bugfix: not closing socket after server disconnect py-amqp#374 (also #379)
Redis broker - prevent event loop polling on closed redis transports (and causing leak) kombu#1476

And, if there are still some scenarios where that doesn't work... There's also these fixes that make Connections and Transports use ~150kb less memory each (making some potential leaks much less severe):

RabbitMQ broker & py-aqmp - reduce memory usage of Connection py-amqp#377 (also #383 and #385)
RabbitMQ broker & librabbitmq - reduce memory usage of Connection librabbitmq#160 (also #162)
All other brokers (including Redis) - reduce memory usage of Transport kombu#1470

Thank you @auvipy for all the feedback and help with getting this stuff reviewed and merged.

auvipy · 2021-12-24T05:40:59Z

@pawl thanks to you and your team mates for the great collaboration & contributions. will push point releases with other merged changes next Sunday if not swallowed by family/holiday vibes. but next week for sure

caleb15 · 2022-01-06T18:30:55Z

@auvipy Just to double-check, version 5.2.3 of celery that you pushed recently has the memory leak fixes, right?

pawl · 2022-01-06T22:11:32Z

@caleb15 Celery 5.2.3 does have a minor leak fix I didn't mention in my comment above: #7187 But, I'm not sure that one is the main one that is generating the complaints in this thread.

I think the main leak fixes are going to come from upgrading kombu to 5.2.3 (if you're using the redis broker) and py-amqp to 5.0.9 (if you're using py-amqp for connecting to rabbitmq).

For more details, see: #4843 (comment)

You may also want to check out this new section of the docs about handling memory leaks: https://docs.celeryproject.org/en/stable/userguide/optimizing.html#memory-usage

Kludex · 2022-10-03T14:25:58Z

@auvipy Were you able to confirm that the issue was solved? If you don't know, I'll spend time checking.

Please let me know. 🙏

auvipy · 2022-10-15T09:47:26Z

@auvipy Were you able to confirm that the issue was solved? If you don't know, I'll spend time checking.

Please let me know. pray

it was partially fixed. but another attempt to fix or figure out the remaining leaks would be very helpful. I sorry for late reply, I took a a week break

Kludex · 2022-10-17T12:44:47Z

I've created this repository: https://github.com/Kludex/celery-leak

On my observations, the memory grows until a certain point, and then it remains constant. It took around 2k tasks to get to the point of being constant.

Can someone point me out, how to reproduce it or what I should try to reproduce it?

harshita01398 · 2023-01-16T06:24:10Z

Seeing this on Celery-4.3.1, Kombu-4.6.11, Redis-4.1.2

Below is average memory chart. The available memory increases when service is restarted during deployment twice a day(mon-fri)

During weekends, available memory keeps on decreasing until service is restarted

@auvipy Any suggestion/fix for this? Does upgrading resolve this issue?

auvipy · 2023-01-16T10:44:01Z

first of all, we really can't tell much anything about an unsupported version, which was released almost 5 years ago. using latest version usually provide more stability in general, and if any issues were raised, generally easier to reproduce/fix.

norbertcyran · 2023-02-16T08:58:40Z

In our case, what we thought was a memory leak, actually turned out to be eta tasks accumulated in the workers. Over a period of a few days, our RAM usage was increasing by 30GB. I hope it might be useful for some of you.

More info:

oleks-popovych · 2023-03-27T11:01:15Z

I'm experiencing memory leak in forked worker. Essentially not all memory freed after consequent task execution.
What kind of approaches I could use to minimize or fix memory leak, except limiting number of tasks or allowed size of memory to consume?

some1ataplace · 2023-03-31T01:04:56Z

General tips and guidance on how to approach fixing memory leaks in Python, which can be applied to the Celery project.

Identify the leak source: Use memory profiling tools like memory_profiler or objgraph to identify the objects that are causing the memory leak. This will help you pinpoint the part of the code that needs fixing.

from memory_profiler import profile

@profile
def your_function():
    # Your code here

Use weak references: If the memory leak is caused by circular references between objects, you can use Python's weakref module to create weak references that don't prevent garbage collection.

import weakref

class MyClass:
    def __init__(self, other_instance=None):
        self.other_instance = weakref.ref(other_instance) if other_instance else None

instance1 = MyClass()
instance2 = MyClass(instance1)
instance1.other_instance = weakref.ref(instance2)

Another example:

import weakref
from celery import Celery

app = Celery('tasks', broker='pyamqp://guest@localhost//')

class ResourceHolder:
    def __init__(self, data):
        self.data = data

# Create a weak reference dictionary for resources
resources = weakref.WeakValueDictionary()

@app.task
def process_resource(resource_id):
    resource_holder = resources.get(resource_id)
    if resource_holder is not None:
        # Process your resource_holder.data here
        pass

def main():
    # Load all resources
    for resource_data in load_resources():
        resource_holder = ResourceHolder(resource_data)
        resources[id(resource_holder)] = resource_holder
        process_resource.apply_async((id(resource_holder),))

if __name__ == "__main__":
    main()

This example assumes that you have resources that need to be processed. Instead of passing the actual resource object to the Celery task, you maintain a weak reference dictionary, and only pass the id. This way, once the resource is no longer needed, it can be garbage collected, preventing a memory leak.

Properly close resources: Ensure that you're properly closing resources like file handles, sockets, and database connections. Use context managers (with statement) whenever possible.

with open('file.txt', 'r') as f:
    content = f.read()

Clear caches and buffers: If you're using caches or buffers, make sure to clear them periodically or when they're no longer needed.

cache.clear()

Use garbage collection: In some cases, you may need to manually call Python's garbage collector to clean up unused objects. Be cautious when using this approach, as it can impact performance.

import gc

gc.collect()

Optimize data structures: Sometimes, memory leaks can be caused by inefficient data structures. Consider using more memory-efficient data structures like array.array, slots, or namedtuple, depending on your use case.

from collections import namedtuple

MyTuple = namedtuple('MyTuple', ['field1', 'field2'])

Limit task results: In the case of Celery, you may want to limit the number of task results stored in the backend by setting the task result expiration time.

app.conf.update(CELERY_TASK_RESULT_EXPIRES=3600)

Monitor and profile: Continuously monitor the memory usage of your application and profile it regularly to identify any potential memory leaks early on.

KyeRussell · 2023-06-02T01:47:12Z

That...kind of reads like a ChatGPT answer.

FabriQuinteros · 2023-06-29T16:24:52Z

I have a memory leak in a process of sending emails, there are 50 celery tasks executed every certain distance(eta) in parallel, that is, it is not necessary to finish a sending task to start, I do it with group() of celery.

Where what he mainly did in this process is (open and close the connection many times with the mail server to send mails) and generate records in the database (there are around 1000 records in 45 minutes) and there comes a time where my memory collapses to the maximum available, what I suppose is that there is a memory leak and it is never recovered, so no matter how long the function ends until the worker is restarted, that memory will not be recovered, what can you recommend I do to avoid this leak?

django 3.2.18
celery 5.2.7
vine 5.0.0
kombu 5.2.3

norbertcyran · 2023-06-29T20:35:01Z

@FabriQuinteros if you use eta tasks, you might find this comment useful: #4843 (comment)

FabriQuinteros · 2023-06-30T19:59:49Z

@norbertcyran I checked it but my problem is short term not long term. I have many other tasks scheduled, besides these. The problem is when they start to run. Not at the moment where I long them to the task queue

hadpro24 · 2023-07-16T05:50:02Z

Hi guys, I advise you to use jmalloc. It has helped us to considerably reduce memory consumption.

Here's my Dockerfile configuration


FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

RUN groupadd -r app && useradd -r -g app app

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
build-essential gcc libpq-dev libc-dev libmagic1 libpq5
RUN apt-get install libjemalloc2 && rm -rf /var/lib/apt/lists/*

ENV LD_PRELOAD /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

WORKDIR /app
COPY requirements.txt .
RUN pip install --upgrade Cython && pip install -r requirements.txt

COPY . .
RUN chown -R app:app /app
USER app

CMD ["/bin/bash", "./entrypoint.sh"]

https://github.com/jemalloc/jemalloc

adalyuf · 2023-08-05T15:35:03Z

For anyone running into this on Django, this helped my memory leak.

Most answers online mention setting CELERYD_MAX_TASKS_PER_CHILD - this is the right idea but the lingo needs to be updated for new django/celery projects.

Celery has switched the naming of certain configuration options.
This would expect CELERYD_MAX_TASKS_PER_CHILD to become worker_max_tasks_per_child, however this is not what should be used in a Django settings file, for use in settings, we need to uppercase and prefix with CELERY.

Celery has a command to make this conversion easy:
celery upgrade settings <project>/settings.py --django

This then will change CELERYD_MAX_TASKS_PER_CHILD to CELERY_WORKER_MAX_TASKS_PER_CHILD

To troubleshoot whether this is working or not, run flower and on the Flower -> Pool tab you should see
Max tasks per child | 200

If this approach doesn't work, you can add it to the worker invocation as
celery ... worker ... --max-tasks-per-child=200

Robin528919 · 2024-03-22T02:34:10Z

-P eventlet Memory keeps rising steadily, is there a solution?

Robin528919 · 2024-03-22T02:41:18Z

Memory keeps rising steadily, is there a solution?

georgepsarakis added the Component: Prefork Workers Pool label Jun 23, 2018

georgepsarakis added the Issue Type: Bug Report label Jun 25, 2018

This was referenced Dec 22, 2021

Kombu 4.1.0 - Memory usage increase (leak?) on a worker when using kombu queues celery/kombu#844

Closed

prevent event loop polling on closed redis transports (and causing leak) celery/kombu#1476

Merged

pawl mentioned this issue Dec 24, 2021

add memory usage section to optimizing docs #7186

Merged

christmoore mentioned this issue Jan 21, 2022

upgrade celery 5.2.3 apache/airflow#19703

Merged

mtraynham mentioned this issue Jan 21, 2022

Celery Worker docker healthcheck causes a memory leak apache/airflow#21026

Closed

2 tasks

Uxio0 mentioned this issue Jan 27, 2022

Fix memory usage on Celery worker safe-global/safe-transaction-service#657

Closed

auvipy modified the milestones: 5.2.x, 5.3.x Jun 29, 2022

jpopelka mentioned this issue Jan 3, 2023

worker-short-running is a memory hog packit/packit-service#1824

Closed

norbertcyran mentioned this issue Feb 16, 2023

Documentation should discourage or warn about using eta/countdown parameters #8069

Closed

4 tasks

Continuous memory leak #4843

Continuous memory leak #4843

Comments

marvelph commented Jun 23, 2018 • edited by sync-by-unito bot

georgepsarakis commented Jun 23, 2018 • edited

marvelph commented Jun 23, 2018

georgepsarakis commented Jun 23, 2018

marvelph commented Jun 23, 2018

marvelph commented Jun 23, 2018

georgepsarakis commented Jun 23, 2018

marvelph commented Jun 23, 2018

georgepsarakis commented Jun 24, 2018 • edited

marvelph commented Jun 24, 2018

jxltom commented Jun 25, 2018 • edited

jxltom commented Jun 25, 2018 • edited

marvelph commented Jun 25, 2018

dmitry-kostin commented Jun 25, 2018 • edited

georgepsarakis commented Jun 25, 2018

dmitry-kostin commented Jun 25, 2018 • edited

georgepsarakis commented Jun 25, 2018

jxltom commented Jun 25, 2018 • edited

georgepsarakis commented Jun 25, 2018

jxltom commented Jun 25, 2018 • edited

jxltom commented Jun 25, 2018 • edited

marvelph commented Jun 26, 2018 • edited

marvelph commented Jun 26, 2018

dmitry-kostin commented Jun 26, 2018 • edited

marvelph commented Jun 26, 2018

jxltom commented Jun 26, 2018 • edited

jxltom commented Jun 26, 2018 • edited

dmitry-kostin commented Jun 26, 2018 • edited

marvelph commented Jun 26, 2018

pawl commented Dec 24, 2021 • edited

auvipy commented Dec 24, 2021

caleb15 commented Jan 6, 2022 • edited

pawl commented Jan 6, 2022 • edited

Kludex commented Oct 3, 2022

auvipy commented Oct 15, 2022

Kludex commented Oct 17, 2022

harshita01398 commented Jan 16, 2023 • edited

auvipy commented Jan 16, 2023

norbertcyran commented Feb 16, 2023

oleks-popovych commented Mar 27, 2023

some1ataplace commented Mar 31, 2023 • edited

KyeRussell commented Jun 2, 2023

FabriQuinteros commented Jun 29, 2023

norbertcyran commented Jun 29, 2023

FabriQuinteros commented Jun 30, 2023

hadpro24 commented Jul 16, 2023 • edited

adalyuf commented Aug 5, 2023

Robin528919 commented Mar 22, 2024

Robin528919 commented Mar 22, 2024

marvelph commented Jun 23, 2018 •

edited by sync-by-unito bot

georgepsarakis commented Jun 23, 2018 •

edited

georgepsarakis commented Jun 24, 2018 •

edited

jxltom commented Jun 25, 2018 •

edited

jxltom commented Jun 25, 2018 •

edited

dmitry-kostin commented Jun 25, 2018 •

edited

dmitry-kostin commented Jun 25, 2018 •

edited

jxltom commented Jun 25, 2018 •

edited

jxltom commented Jun 25, 2018 •

edited

jxltom commented Jun 25, 2018 •

edited

marvelph commented Jun 26, 2018 •

edited

dmitry-kostin commented Jun 26, 2018 •

edited

jxltom commented Jun 26, 2018 •

edited

jxltom commented Jun 26, 2018 •

edited

dmitry-kostin commented Jun 26, 2018 •

edited

pawl commented Dec 24, 2021 •

edited

caleb15 commented Jan 6, 2022 •

edited

pawl commented Jan 6, 2022 •

edited

harshita01398 commented Jan 16, 2023 •

edited

some1ataplace commented Mar 31, 2023 •

edited

hadpro24 commented Jul 16, 2023 •

edited