Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle memory use increasing over time #1795

Open
bnaul opened this issue Mar 1, 2018 · 16 comments
Open

Idle memory use increasing over time #1795

bnaul opened this issue Mar 1, 2018 · 16 comments

Comments

@bnaul
Copy link
Contributor

bnaul commented Mar 1, 2018

In the process of debugging some memory issues I noticed that memory usage of a scheduler+worker with no client connection was steadily increasing over time.

Command: dask-scheduler --no-bokeh & dask-worker localhost:8786 --nthreads 1 --nprocs 120 --memory-limit 3.2GB --no-bokeh & (default config.yaml)
Result: after a few idle hours, total memory usage went from about 1GB at startup to 4GB (as reported by the Google Cloud dashboard).

I'm aware that there are a lot of subtleties around measuring memory usage on Linux so I'm not sure if this is a real issue or maybe an artifact of the measurement process, but it seemed like a lot of memory for totally inactive processes. Curious if anyone has any thoughts about what might be happening.

@mrocklin
Copy link
Member

@bnaul did you end up finding any additional information on this issue?

@bnaul
Copy link
Contributor Author

bnaul commented Mar 19, 2018

Nothing new; I also realized that there's another layer of complexity since this was happening inside a Docker container, so there's other stuff going on that makes it even harder to diagnose. I would probably say close this but I'll leave it up to you.

@ameetshah1983
Copy link

ameetshah1983 commented Apr 20, 2018

We are running dask scheduler on windows VM and memory utilization gradually increases till system memory usage reaches 98%. We then have to restart the scheduler as else we receive timeouts from workers trying to connect. This does take a few days and our allocated memory for VM is 16GB.

We are currently on distributed '1.21.3' and dask '0.17.1'

Sorry one thing to add, in our case the grid is not completely idle but do have jobs running from time to time. Please let me know if this should be listed as a separate issue in that case.

@mrocklin
Copy link
Member

mrocklin commented Apr 20, 2018 via email

@ameetshah1983
Copy link

Upgraded to dask 0.17.2 and distributed - 1.21.6. Even with no job being run, the memroy slowly keeps increasing. It does take time to increase. Currently its at 5.5GB but have seen it grow till 14GB.

@cpaulik
Copy link

cpaulik commented Jun 14, 2018

I have the same problem. For me the memory keeps increasing until the machine crashes. Did you find a solution to this problem? How can I run dask-scheduler to debug this?

@mrocklin
Copy link
Member

Help would be welcome from anyone who is able to provide more concrete detail about what causes any sort of memory leak. It would be especially valuable to find a mininal example that reliably produced the leak.

@mrocklin
Copy link
Member

How can I run dask-scheduler to debug this?

Any normal mechanisms to track memory use in Python would be fine.

@cpaulik
Copy link

cpaulik commented Aug 2, 2018

We are still running into this issue. And I was not yet able to find a minimal example.

There are two things that I noticed in our processing chain that might cause issues:

  1. We are running in docker containers on Google Cloud. Our idle workers always have 10% memory consumption. In Significant CPU usage when cluster is idle on large machine #2079 I understood that this should not be the case so it might be related to some container settings.

  2. Our processing uses compiled extensions that do not release the GIL. Could that be a cause for a memory increase?

I will try to get a minimal example still but I thought this information might help narrow it down a little.

@mrocklin
Copy link
Member

mrocklin commented Aug 2, 2018 via email

@cpaulik
Copy link

cpaulik commented Aug 2, 2018

Yes, but the scheduler will only import e.g. numpy or pandas once? Our workers never crashed and don't seem to have any memory leakage. It is just the scheduler for us.

I'll keep looking for a minimal example that I can reproduce locally.

@mpeaton
Copy link

mpeaton commented Oct 2, 2018

Workers crash due to OOM ddf.to_parquet() of large files. Also of note is that client.cancel() of futures containing the aforementioned method, fails to free memory in the workers.

@chinmaychandak
Copy link

@mrocklin Any update on this?

In the process of debugging some memory issues I noticed that memory usage of a scheduler+worker with no client connection was steadily increasing over time.

I am seeing a similar memory increase. Even with no client connection, the scheduler + worker memory keeps increasing. Is this because of the heartbeat/connection/some other logs being kept in system memory and not being cleaned out?

I was debugging some memory leak issues, and I found that restarting the workers every hour or so (I require Dask workers to run for longer periods of time) helped by cleaning the memory slate of the workers (although with this bug), but this issue seems orthogonal to it.

@mrocklin
Copy link
Member

@mrocklin Any update on this?

The last update I see on this issue is in 2018, so I'm guessing not.

Is this because of the heartbeat/connection/some other logs being kept in system memory and not being cleaned out?

I doubt it, but if anyone wants to investigate this and report back that would be welcome.

If people want to help resolve this issue then I think that the best thing to do is to provide a minimal reproducible example, preferably something that people can observe on their laptop.

I was debugging some memory leak issues, and I found that restarting the workers every hour or so (I require Dask workers to run for longer periods of time) helped by cleaning the memory slate of the workers (although with this bug), but this issue seems orthogonal to it.

You might want to check out the --lifetime-* options in the dask-worker CLI for automatic help with this.

@chinmaychandak
Copy link

If people want to help resolve this issue then I think that the best thing to do is to provide a minimal reproducible example, preferably something that people can observe on their laptop.

Ok, I will try to create a minimal reproducer and post it here.

@amohar2
Copy link

amohar2 commented Nov 2, 2021

I am seeing this issue on my end, with dask and distributed versions 2021.9.1
The issue is easy to reproduce using CLI, all I need to do is to start dask-scheduler on one terminal, and a dask-worker (with --nprocs 1 --nthreads 1) on another terminal, here is the output of top:
$ top | grep scheduler
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4888 user 20 0 1123128 91400 27876 S 0.7 0.0 0:02.51 dask-scheduler
And resident set size (RES) steadily increases, for example after 20-30 minutes here is the output of top command:
$ top | grep scheduler
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4888 user 20 0 1130736 98708 27876 S 0.8 0.0 0:22.88 dask-scheduler
(RES went from 91400 to 98708)

The same steady increase can be seen on the worker as well:
$ top | grep python
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28549 user 20 0 1421252 94220 28192 S 0.7 0.0 0:02.60 python3.8
(after few minutes)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28549 user 20 0 1425604 98640 28192 S 0.7 0.0 0:07.32 python3.8
(RES went from 94220 to 98640)

Even with ONLY starting dask-scheduler (i.e., no dask workers being started) I can still see this slow but steady increase in RES value on the dask-scheduler process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants