New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Worker stuck in "closing-gracefully" state #3018
Comments
Thank yo for the detailed bug report and analysis @AnesBenmerzoug . This looks like it might be the same as #1930 ? The replicate code is not great, but has been useful enough to find its way into various code paths. I'm a little surprised actually that issues like yours are as rare as they are. I think that there are two paths forward here:
|
@mrocklin It took some time, but I managed to recreate the issue locally. I noticed that it happens when a worker that is being closed holds many keys in memory. Failing exampleimport logging
import time
from random import randint
import dask.bag as db
import distributed
from loguru import logger
class InterceptHandler(logging.Handler):
def emit(self, record):
# Retrieve context where the logging call occurred, this happens to be in the 6th frame upward
logger_opt = logger.opt(depth=6, exception=record.exc_info)
logger_opt.log(logging.getLevelName(record.levelno), record.getMessage())
logging.getLogger("distributed").handlers = [InterceptHandler()]
def busy_work(x):
time.sleep(randint(1, 10))
x = list(map(lambda a: a + 1, x))
return x
def main():
cluster = distributed.LocalCluster(
n_workers=4,
threads_per_worker=4,
lifetime="5s",
lifetime_stagger="2s",
lifetime_restart=True,
silence_logs=logging.DEBUG,
)
with distributed.Client(cluster) as client:
numbers = list(range(10000))
numbers_bag = db.from_sequence(numbers, npartitions=len(numbers))
numbers_bag = numbers_bag.remove(lambda x: x % 2 == 0)
numbers_bag = numbers_bag.repartition(npartitions=len(numbers) // 10)
numbers_bag = numbers_bag.map_partitions(busy_work)
numbers_bag = numbers_bag.map_partitions(busy_work)
delayed_number = numbers_bag.to_delayed()
futures = client.compute(delayed_number)
distributed.wait(futures)
if __name__ == "__main__":
main() RequirementsThis snippet requires:
loguru is just a convenient logging package that has good defaults and that can be used to intercept unexpected exceptions. While trying to recreate the error in question I stumbled into another error. if self.status and self.status.startswith("clos"):
return So maybe changing adding an extra check for None would make sense. Stack trace2019-09-03 13:35:23.705 | INFO | distributed.worker:_register_with_scheduler:780 - -------------------------------------------------
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f1a58e82748>>, <Task finished coro=<Worker.close_gracefully() done, defined at /opt/venv/lib/python3.6/site-packages/distributed/worker.py:1104> exception=AttributeError("'NoneType' object has no attribute 'startswith'",)>)
Traceback (most recent call last):
File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/venv/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/venv/lib/python3.6/site-packages/distributed/worker.py", line 1110, in close_gracefully
if self.status.startswith("closing"):
AttributeError: 'NoneType' object has no attribute 'startswith' |
I'm encountering the same problem: unsure what is the underlying cause, but self is distributed/distributed/worker.py Lines 1173 to 1188 in 1fe50c2
|
@mrocklin Has there been any update on this? I'm getting a similar error pretty consistently. My use case requires workers to run for longer periods of time (streaming jobs for example, wherein Dask workers keep processing batches of data every minute or so); and I'm periodically restarting the workers using the FYI: I have over-provisioned my Dask cluster to ensure that there are enough active workers available to process data when a set of other workers are restarting. Can someone please help? |
FWIW, I never managed to get the |
@chinmaychandak the updates on this are mostly like as you see them on the issue. I think the phrasing (though mostly likely not intended) adds stress to the maintainers of dask. What would be more helpful in pushing this issue forward is minimal reproducible example |
@SultanOrazbayev My understanding is that My stack trace:
|
Apologies if the phrasing was inappropriate! Definitely didn't intend to offend/stress out anyone. I saw a minimal reproducer posted already, so I thought this issue had gotten overlooked or something. But I will definitely try to create my own reproducer, and will post it here. |
Oh yeah, you're right. I guess the next thing for someone to do is to try to figure out why it's failing. Maybe that someone is you? I think that what people may not understand is that it's no one's job to fix these issues. The people who do so are often doing so for free as volunteers, or because they need to fix them to solve some problem that they're having at work. Unfortunately people sometimes treat these community github issue trackers as a place where they go to ask people to do free work for them. They look a lot like other github issue trackers that they use in their workplace to ask other teams in their company to do work for them, which is reasonable given that those teams are paid. Instead, I encourage you to think of these issue trackers as a place to collaborate on work. @AnesBenmerzoug was kind enough to make a reproducer (at significant personal cost it sounds like) great, who can take up the torch and work from there? Alternatively, if there are people paid by your company to fix these problems then maybe you can point them here and they can do this work. People like me volunteer our time to help shepherd this process along, but we're not here to fix everyone's problem for free. There are too many problems to fix unfortunately, and we tend to be pretty busy fixing the problems that people pay us to fix.
If the reproducer provided by @AnesBenmerzoug matches your situation then great. I just ran it but after a few minutes I'm not quite sure what I'm looking at, so I'm probably going to move on. Maybe you can help investigate here? |
I definitely agree with everything here, and I again sincerely apologize for the inappropriate phrasing. Did not ever intend to ask people to fix my problem, or to stress you or any of the other maintainers. I think it's brilliant enough that so many important features are getting merged into Dask, and open-source projects in general! :)
Yes, I am going to try to investigate this soon. Will post findings here. |
Hi @chinmaychandak, did you have some success in debugging the issue? Is there some workaround, e.g. by force-restarting the worker or does this also destroy the futures residing on them? |
Hey @Hoeze, I wasn't able to figure out how to fix the issue, but I did a couple of things as a workaround:
|
Thanks a lot for your tips @chinmaychandak :) |
Hi all, I'm encountering a similar issue with dask / distributed 2021.10. I'm also using lifetime and stagger to periodically cleanup RAM.
This results in tasks being held up indefinitely in that "zombie" worker. |
We're currently working on making the graceful downscaling much more robust which should avoid the above replicate assertion error. See #5381 for the current WIP. we're struggling with a few flaky tests bur are hoping to merge soon. |
I have a cluster composed of 8 workers distributed accross in k8s pods in AWS.
I noticed that sometimes one of the workers gets stuck in the "closing-gracefully" state because of an assertion error in the worker's close_gracefully() method.
For some reason the scheduler still tries to send tasks to this worker, which just fail after some time because the worker does not actually execute them.
Manually closing the worker using the client's retire_workers() methods works and I'm currently using it as a workaround.
After digging around the code base a bit, I found that the part of the code responsible for this behaviour is in the scheduler's replicate() method. The failing assertion, which I did not completely understand, is not handled properly and so leads the worker to not close properly.
From analyzing the expression I could conclude the following:
n_missing
is greater than 0, otherwise the method would have returnedbranching_factor
's default value is used which 2From those two points it seems that
len(ts.who_has)
is 0Unfortunately I did not yet find a minimal example to reproduce this example.
Stack Trace
Workers status
Manually Closing the Worker
Scheduler Info
The text was updated successfully, but these errors were encountered: