-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idle memory use increasing over time #1795
Comments
@bnaul did you end up finding any additional information on this issue? |
Nothing new; I also realized that there's another layer of complexity since this was happening inside a Docker container, so there's other stuff going on that makes it even harder to diagnose. I would probably say close this but I'll leave it up to you. |
We are running dask scheduler on windows VM and memory utilization gradually increases till system memory usage reaches 98%. We then have to restart the scheduler as else we receive timeouts from workers trying to connect. This does take a few days and our allocated memory for VM is 16GB. We are currently on distributed '1.21.3' and dask '0.17.1' Sorry one thing to add, in our case the grid is not completely idle but do have jobs running from time to time. Please let me know if this should be listed as a separate issue in that case. |
Can you update to the latest release of distributed and see if this problem
persists?
…On Thu, Apr 19, 2018 at 9:53 PM, ameetshah1983 ***@***.***> wrote:
We are running dask scheduler on windows VM and memory utilization
gradually increases till system memory usage reaches 98%. We then have to
restart the scheduler as else we receive timeouts from workers trying to
connect. This does take a few days and our allocated memory for VM is 16GB.
We are currently on distributed '1.21.3' and dask '0.17.1'
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1795 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszNzrGM_9u8pdUmf6bdlT8Jwrnx3iks5tqT-ygaJpZM4SZB0B>
.
|
Upgraded to dask 0.17.2 and distributed - 1.21.6. Even with no job being run, the memroy slowly keeps increasing. It does take time to increase. Currently its at 5.5GB but have seen it grow till 14GB. |
I have the same problem. For me the memory keeps increasing until the machine crashes. Did you find a solution to this problem? How can I run dask-scheduler to debug this? |
Help would be welcome from anyone who is able to provide more concrete detail about what causes any sort of memory leak. It would be especially valuable to find a mininal example that reliably produced the leak. |
Any normal mechanisms to track memory use in Python would be fine. |
We are still running into this issue. And I was not yet able to find a minimal example. There are two things that I noticed in our processing chain that might cause issues:
I will try to get a minimal example still but I thought this information might help narrow it down a little. |
Memory consumption might just be things like importing numpy and pandas,
which are substantial. To the best of my knowledge GIL-bound functions
shouldn't have any effect on memory use.
A minimal example probably remains the best way to make progress on this
problem.
…On Thu, Aug 2, 2018 at 11:42 AM, Christoph Paulik ***@***.***> wrote:
We are still running into this issue. And I was not yet able to find a
minimal example.
There are two things that I noticed in our processing chain that might
cause issues:
1.
We are running in docker containers on Google Cloud. Our idle workers
always have 10% memory consumption. In #2079
<#2079> I understood that
this should not be the case so it might be related to some container
settings.
2.
Our processing uses compiled extensions that do not release the GIL.
Could that be a cause for a memory increase?
I will try to get a minimal example still but I thought this information
might help narrow it down a little.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1795 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszL4h9yjT1zjsjlay543jbuGJ5iTCks5uMx4CgaJpZM4SZB0B>
.
|
Yes, but the scheduler will only import e.g. numpy or pandas once? Our workers never crashed and don't seem to have any memory leakage. It is just the scheduler for us. I'll keep looking for a minimal example that I can reproduce locally. |
Workers crash due to OOM ddf.to_parquet() of large files. Also of note is that client.cancel() of futures containing the aforementioned method, fails to free memory in the workers. |
@mrocklin Any update on this?
I am seeing a similar memory increase. Even with no client connection, the scheduler + worker memory keeps increasing. Is this because of the heartbeat/connection/some other logs being kept in system memory and not being cleaned out? I was debugging some memory leak issues, and I found that restarting the workers every hour or so (I require Dask workers to run for longer periods of time) helped by cleaning the memory slate of the workers (although with this bug), but this issue seems orthogonal to it. |
The last update I see on this issue is in 2018, so I'm guessing not.
I doubt it, but if anyone wants to investigate this and report back that would be welcome. If people want to help resolve this issue then I think that the best thing to do is to provide a minimal reproducible example, preferably something that people can observe on their laptop.
You might want to check out the |
Ok, I will try to create a minimal reproducer and post it here. |
I am seeing this issue on my end, with dask and distributed versions 2021.9.1 The same steady increase can be seen on the worker as well: Even with ONLY starting dask-scheduler (i.e., no dask workers being started) I can still see this slow but steady increase in RES value on the dask-scheduler process. |
In the process of debugging some memory issues I noticed that memory usage of a scheduler+worker with no client connection was steadily increasing over time.
Command:
dask-scheduler --no-bokeh & dask-worker localhost:8786 --nthreads 1 --nprocs 120 --memory-limit 3.2GB --no-bokeh &
(default config.yaml)Result: after a few idle hours, total memory usage went from about 1GB at startup to 4GB (as reported by the Google Cloud dashboard).
I'm aware that there are a lot of subtleties around measuring memory usage on Linux so I'm not sure if this is a real issue or maybe an artifact of the measurement process, but it seemed like a lot of memory for totally inactive processes. Curious if anyone has any thoughts about what might be happening.
The text was updated successfully, but these errors were encountered: