Idle memory use increasing over time #1795

bnaul · 2018-03-01T21:34:42Z

In the process of debugging some memory issues I noticed that memory usage of a scheduler+worker with no client connection was steadily increasing over time.

Command: dask-scheduler --no-bokeh & dask-worker localhost:8786 --nthreads 1 --nprocs 120 --memory-limit 3.2GB --no-bokeh & (default config.yaml)
Result: after a few idle hours, total memory usage went from about 1GB at startup to 4GB (as reported by the Google Cloud dashboard).

I'm aware that there are a lot of subtleties around measuring memory usage on Linux so I'm not sure if this is a real issue or maybe an artifact of the measurement process, but it seemed like a lot of memory for totally inactive processes. Curious if anyone has any thoughts about what might be happening.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-03-19T17:17:02Z

@bnaul did you end up finding any additional information on this issue?

bnaul · 2018-03-19T18:38:54Z

Nothing new; I also realized that there's another layer of complexity since this was happening inside a Docker container, so there's other stuff going on that makes it even harder to diagnose. I would probably say close this but I'll leave it up to you.

ameetshah1983 · 2018-04-20T01:53:53Z

We are running dask scheduler on windows VM and memory utilization gradually increases till system memory usage reaches 98%. We then have to restart the scheduler as else we receive timeouts from workers trying to connect. This does take a few days and our allocated memory for VM is 16GB.

We are currently on distributed '1.21.3' and dask '0.17.1'

Sorry one thing to add, in our case the grid is not completely idle but do have jobs running from time to time. Please let me know if this should be listed as a separate issue in that case.

mrocklin · 2018-04-20T07:10:39Z

Can you update to the latest release of distributed and see if this problem persists?

…

On Thu, Apr 19, 2018 at 9:53 PM, ameetshah1983 ***@***.***> wrote: We are running dask scheduler on windows VM and memory utilization gradually increases till system memory usage reaches 98%. We then have to restart the scheduler as else we receive timeouts from workers trying to connect. This does take a few days and our allocated memory for VM is 16GB. We are currently on distributed '1.21.3' and dask '0.17.1' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1795 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszNzrGM_9u8pdUmf6bdlT8Jwrnx3iks5tqT-ygaJpZM4SZB0B> .

ameetshah1983 · 2018-04-27T18:46:08Z

Upgraded to dask 0.17.2 and distributed - 1.21.6. Even with no job being run, the memroy slowly keeps increasing. It does take time to increase. Currently its at 5.5GB but have seen it grow till 14GB.

cpaulik · 2018-06-14T22:21:17Z

I have the same problem. For me the memory keeps increasing until the machine crashes. Did you find a solution to this problem? How can I run dask-scheduler to debug this?

mrocklin · 2018-06-15T01:42:50Z

Help would be welcome from anyone who is able to provide more concrete detail about what causes any sort of memory leak. It would be especially valuable to find a mininal example that reliably produced the leak.

mrocklin · 2018-06-15T01:43:09Z

How can I run dask-scheduler to debug this?

Any normal mechanisms to track memory use in Python would be fine.

cpaulik · 2018-08-02T15:40:53Z

We are still running into this issue. And I was not yet able to find a minimal example.

There are two things that I noticed in our processing chain that might cause issues:

We are running in docker containers on Google Cloud. Our idle workers always have 10% memory consumption. In Significant CPU usage when cluster is idle on large machine #2079 I understood that this should not be the case so it might be related to some container settings.
Our processing uses compiled extensions that do not release the GIL. Could that be a cause for a memory increase?

I will try to get a minimal example still but I thought this information might help narrow it down a little.

mrocklin · 2018-08-02T15:48:49Z

Memory consumption might just be things like importing numpy and pandas, which are substantial. To the best of my knowledge GIL-bound functions shouldn't have any effect on memory use. A minimal example probably remains the best way to make progress on this problem.

…

On Thu, Aug 2, 2018 at 11:42 AM, Christoph Paulik ***@***.***> wrote: We are still running into this issue. And I was not yet able to find a minimal example. There are two things that I noticed in our processing chain that might cause issues: 1. We are running in docker containers on Google Cloud. Our idle workers always have 10% memory consumption. In #2079 <#2079> I understood that this should not be the case so it might be related to some container settings. 2. Our processing uses compiled extensions that do not release the GIL. Could that be a cause for a memory increase? I will try to get a minimal example still but I thought this information might help narrow it down a little. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1795 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszL4h9yjT1zjsjlay543jbuGJ5iTCks5uMx4CgaJpZM4SZB0B> .

cpaulik · 2018-08-02T15:54:58Z

Yes, but the scheduler will only import e.g. numpy or pandas once? Our workers never crashed and don't seem to have any memory leakage. It is just the scheduler for us.

I'll keep looking for a minimal example that I can reproduce locally.

mpeaton · 2018-10-02T14:42:54Z

Workers crash due to OOM ddf.to_parquet() of large files. Also of note is that client.cancel() of futures containing the aforementioned method, fails to free memory in the workers.

chinmaychandak · 2020-06-30T19:56:55Z

@mrocklin Any update on this?

In the process of debugging some memory issues I noticed that memory usage of a scheduler+worker with no client connection was steadily increasing over time.

I am seeing a similar memory increase. Even with no client connection, the scheduler + worker memory keeps increasing. Is this because of the heartbeat/connection/some other logs being kept in system memory and not being cleaned out?

I was debugging some memory leak issues, and I found that restarting the workers every hour or so (I require Dask workers to run for longer periods of time) helped by cleaning the memory slate of the workers (although with this bug), but this issue seems orthogonal to it.

mrocklin · 2020-06-30T20:45:39Z

@mrocklin Any update on this?

The last update I see on this issue is in 2018, so I'm guessing not.

Is this because of the heartbeat/connection/some other logs being kept in system memory and not being cleaned out?

I doubt it, but if anyone wants to investigate this and report back that would be welcome.

If people want to help resolve this issue then I think that the best thing to do is to provide a minimal reproducible example, preferably something that people can observe on their laptop.

I was debugging some memory leak issues, and I found that restarting the workers every hour or so (I require Dask workers to run for longer periods of time) helped by cleaning the memory slate of the workers (although with this bug), but this issue seems orthogonal to it.

You might want to check out the --lifetime-* options in the dask-worker CLI for automatic help with this.

chinmaychandak · 2020-06-30T21:04:36Z

If people want to help resolve this issue then I think that the best thing to do is to provide a minimal reproducible example, preferably something that people can observe on their laptop.

Ok, I will try to create a minimal reproducer and post it here.

amohar2 · 2021-11-02T18:01:36Z

I am seeing this issue on my end, with dask and distributed versions 2021.9.1
The issue is easy to reproduce using CLI, all I need to do is to start dask-scheduler on one terminal, and a dask-worker (with --nprocs 1 --nthreads 1) on another terminal, here is the output of top:
$ top | grep scheduler
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4888 user 20 0 1123128 91400 27876 S 0.7 0.0 0:02.51 dask-scheduler
And resident set size (RES) steadily increases, for example after 20-30 minutes here is the output of top command:
$ top | grep scheduler
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4888 user 20 0 1130736 98708 27876 S 0.8 0.0 0:22.88 dask-scheduler
(RES went from 91400 to 98708)

The same steady increase can be seen on the worker as well:
$ top | grep python
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28549 user 20 0 1421252 94220 28192 S 0.7 0.0 0:02.60 python3.8
(after few minutes)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28549 user 20 0 1425604 98640 28192 S 0.7 0.0 0:07.32 python3.8
(RES went from 94220 to 98640)

Even with ONLY starting dask-scheduler (i.e., no dask workers being started) I can still see this slow but steady increase in RES value on the dask-scheduler process.

fjetter mentioned this issue Nov 23, 2018

Scheduler memory increasing over time due to ever growing event log #2371

Closed

GenevieveBuckley added the performance label Oct 18, 2021

fjetter mentioned this issue Sep 6, 2023

Mild memory leak in dask workers #8164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle memory use increasing over time #1795

Idle memory use increasing over time #1795

bnaul commented Mar 1, 2018

mrocklin commented Mar 19, 2018

bnaul commented Mar 19, 2018

ameetshah1983 commented Apr 20, 2018 •

edited

mrocklin commented Apr 20, 2018 via email

ameetshah1983 commented Apr 27, 2018

cpaulik commented Jun 14, 2018

mrocklin commented Jun 15, 2018

mrocklin commented Jun 15, 2018

cpaulik commented Aug 2, 2018

mrocklin commented Aug 2, 2018 via email

cpaulik commented Aug 2, 2018

mpeaton commented Oct 2, 2018 •

edited

chinmaychandak commented Jun 30, 2020

mrocklin commented Jun 30, 2020

chinmaychandak commented Jun 30, 2020

amohar2 commented Nov 2, 2021 •

edited

Idle memory use increasing over time #1795

Idle memory use increasing over time #1795

Comments

bnaul commented Mar 1, 2018

mrocklin commented Mar 19, 2018

bnaul commented Mar 19, 2018

ameetshah1983 commented Apr 20, 2018 • edited

mrocklin commented Apr 20, 2018 via email

ameetshah1983 commented Apr 27, 2018

cpaulik commented Jun 14, 2018

mrocklin commented Jun 15, 2018

mrocklin commented Jun 15, 2018

cpaulik commented Aug 2, 2018

mrocklin commented Aug 2, 2018 via email

cpaulik commented Aug 2, 2018

mpeaton commented Oct 2, 2018 • edited

chinmaychandak commented Jun 30, 2020

mrocklin commented Jun 30, 2020

chinmaychandak commented Jun 30, 2020

amohar2 commented Nov 2, 2021 • edited

ameetshah1983 commented Apr 20, 2018 •

edited

mpeaton commented Oct 2, 2018 •

edited

amohar2 commented Nov 2, 2021 •

edited