Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak on Unhandled Exceptions #8882

Open
11 of 18 tasks
ralphie0112358 opened this issue Feb 29, 2024 · 0 comments
Open
11 of 18 tasks

Memory Leak on Unhandled Exceptions #8882

ralphie0112358 opened this issue Feb 29, 2024 · 0 comments

Comments

@ralphie0112358
Copy link

ralphie0112358 commented Feb 29, 2024

Checklist

  • I have verified that the issue exists against the main branch of Celery.
  • This has already been asked to the discussions forum first.
  • I have read the relevant section in the
    contribution guide
    on reporting bugs.
  • I have checked the issues list
    for similar or identical bug reports.
  • I have checked the pull requests list
    for existing proposed fixes.
  • I have checked the commit log
    to find out if the bug was already fixed in the main branch.
  • I have included all related issues and possible duplicate issues
    in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
  • I have verified that the issue exists against the main branch of Celery.
  • I have included the contents of pip freeze in the issue.
  • I have included all the versions of all the external dependencies required
    to reproduce this bug.

Optional Debugging Information

  • I have tried reproducing the issue on more than one Python version
    and/or implementation.
  • I have tried reproducing the issue on more than one message broker and/or
    result backend.
  • I have tried reproducing the issue on more than one version of the message
    broker and/or result backend.
  • I have tried reproducing the issue on more than one operating system.
  • I have tried reproducing the issue on more than one workers pool.
  • I have tried reproducing the issue with autoscaling, retries,
    ETA/Countdown & rate limits disabled.
  • I have tried reproducing the issue after downgrading
    and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

  • None

Possible Duplicates

  • None

Environment & Settings

Celery version: 5.3.6 (emerald-rush)

celery report Output:

software -> celery:5.3.6 (emerald-rush) kombu:5.3.5 py:3.11.8
            billiard:4.2.0 py-amqp:5.2.0
platform -> system:Linux arch:64bit, ELF
            kernel version:6.2.0-1017-aws imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:pyamqp results:disabled

broker_url: 'amqp://guest:********@localhost:5672//'
deprecated_settings: None

Steps to Reproduce

Required Dependencies

  • Minimal Python Version: N/A or Unknown
  • Minimal Celery Version: N/A or Unknown
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

amqp==5.2.0
billiard==4.2.0
celery==5.3.6
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
kombu==5.3.5
prompt-toolkit==3.0.43
psutil==5.9.8
python-dateutil==2.8.2
six==1.16.0
tzdata==2024.1
vine==5.1.0
wcwidth==0.2.13

Other Dependencies

N/A

Minimally Reproducible Test Case

Expected Behavior

Tasks raising unhandled exceptions should not consume excessive worker memory.

Actual Behavior

I first observed this issue in our production environment where unhandled tasks exceptions would appear to "leak" memory in the worker process. Our production enviroment is AWS/SQS with Django backend. I was able to isolate the behavior in a very minimal out-of-the-box celery example against rabbitmq broker (exactly as shown in the getting started).

Here is my tasks.py

from celery import Celery

app = Celery('tasks', broker='pyamqp://guest@localhost//')


@app.task
def ok():
    pass


@app.task
def bad():
    raise RuntimeError("err")


@app.task(bind=True)
def again(self):
    if self.request.retries <  self.max_retries:
        raise self.retry(countdown=0.1)


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("task")
    parser.add_argument("--count", type=int, default=1)
    args = parser.parse_args()

    task = app.tasks.get(f"tasks.{args.task}")
    for _ in range(args.count):
        task.delay()

I start rabbit mq per the example (only adding --rm)

docker run -d --rm -p 5672:5672 rabbitmq

In one shell I run the celery job

env/bin/celery -A tasks worker --loglevel=INFO --concurrency=1

In another shell, monitor the worker RSS memory. I leave this to you, but ps fax | grep celery to find the worker pid, thentop -p {worker_pid} to monitor works well enough. In top the RSS is denoted in the RES column.

In another shell I issue jobs per the small __main__ above in tasks.py

Use the ok task as a control to observe that the memory does not increase

env/bin/python tasks ok --count 1000

At this point, the process should be fully loaded and I observe the worker's RSS at 34896KB

When issuing a single tasks.bad generally 2-300K memory may be consumed and not returned. The first run consumes more likely heating up some python code that was not run yet.

env/bin/python tasks bad

Now the worker's RSS is at 35792KB

In a tight loop issue 1000 of tasks.bad, excessive memory is allocated

env/bin/python tasks bad --count 1000

Now the worker's RSS is at 107168KB

In this environment if I issue another batch of bad, the RSS tops out around 116MB

I know these types of python issues (or even could be an python-ism) are difficult to deal with, but after some internal discussion we thought to minimally bring it to your attention and raise awareness to others.

Using code inspection I theorized similar behavior for a Retry (see My Analysis below). You can run the same experiment again and this time instead of tasks.bad, use tasks.again and you should seem similar phenomenon only to a lesser extent.

In my production environment we see much larger jumps in RSS per unhandled exception.

My Analysis

I did some sleuthing and I believe the issue centers on the billiard serialization of the traceback which I assume is returned to the parent process. This is the einfo object from billiard.einfo import ExceptionInfo. I believe some python 3.11 specific code aggravates the issue.

Here celery builds the einfo object

Here is some python 3.11 specific billiard code which collects bytecode info for the traceback frames. My analysis was this was the majority of bytes allocated (I'm not expert here but that's my quick understanding). I used memray along with the -P solo option to observe the memory allocation flamegraph.

This einfo object is propagated up through the stack. I could avoid the problem if lower down the call stack I remove reference to the einfo. e.g. set R=None before the return at line 575. As the object propagates up the stack, that does not fix the issue. For instance setting R=None at the end of fast_trace_task line 654 does not fix the issue.

The allocations seem proportional to the complexity of the traceback. In our production environment, the "leaks" are much bigger.

Warning here are some hypothesis based on observation and a few days recent study into this issue. Don't take this as truth, but it might lead someone in the right direction.
I believe the object is "lost" to the gc reference counter at some point in the call stack, and a more complicated gc collection becomes necessary. If I insert a gc.collect() somewhere in the call stack it seems to mitigate the issue. I believe so many objects are allocated/freed quickly while also allocation this set of einfo object components (and not freed), this tends to moves the einfo to the generation 2 management within the gc. Once the allocations are made on these smaller size objects the memory is not freed back to the OS.

In our own deployment if I put gc.collect() in handle_success and handle_failure, then I do not observe the "leak". We are not internally convinced at this solution yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants