Deadlock - task not running #5366

chrisroat · 2021-09-28T22:32:44Z

What happened:

When running on an auto-scaling GKE cluster using dask-gateway, I sometimes find computation halting mid-graph. One or more workers will have tasks, but not actual by doing any work. Often the logs contain a traceback from some sort of failure.

@fjetter

What you expected to happen:

Graph to finish to completion.

Minimal Complete Verifiable Example:

I don't have a reproducable example, as the graph can sometimes succeed and will often succeed if I kill the worker(s) that are stuck. I am including scheduler and worker info as per @fjetter 's script (#5068), as well as GKE logs showing a traceback earlier in the process.

Anything else we need to know?:

K8s log

distributed.worker - ERROR - failed during get data with tls://10.16.132.2:38819 -> tls://10.16.95.2:33407

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

distributed.core - INFO - Lost connection to 'tls://10.16.95.2:44406'
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 495, in handle_comm result = await result File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

Environment:

Dask version: 2021.09.1+9.g9f587507
Distributed version: 2021.09.1+14.gef281377
Python version: python3.8
Operating System: ubuntu 20.04
Install method (conda, pip, source): pip

scheduler.pkl.gz
worker.pkl.gz

chrisroat · 2021-09-28T22:33:43Z

FYI -- if I go to the "Call Stack" for the task in the Dask UI, I find that it says

"Task not actively running. It may be finished or not yet started"

chrisroat · 2021-09-29T05:29:22Z

FWIW, if I turn off autoscaling and keep the clusters full size, I haven't seen deadlocks. Could it be that the extra communication from autoscaling is interfering? I will investigate if I can adjust the rate at which the autoscaling operates.

fjetter · 2021-09-29T12:03:47Z

thanks for the report and the cluster dump. After a brief look at the dump, this looks like the issue I'm currently investigating (reported offline). This is very helpful! I'll let you know once there is a fix available.

How reliable can you reproduce this? Does this happen on every run or only every ~100th?

chrisroat · 2021-09-29T13:18:43Z

I'm glad it is useful. It seems to happen fairly regularly, if I'm letting the cluster fluctuate in size.

Another data point: I did a test with 2021.9.1 with autoscaling set with min=max so the cluster would remain constant size, and I didn't see any deadlocking. I thought this might disambiguate if the problem was the auto-scale communication versus the up/down of workers. But if you may already have a root cause, that's better news.

jrbourbeau · 2021-10-15T17:09:00Z

@chrisroat if you get a chance could you try again with the latest main branch of distributed? We had hoped that #5379 would fix this deadlock, but wanted to confirm whether or not this is the case

chrisroat · 2021-10-17T02:59:57Z

I will try it out sometime this week, before your planned release on Friday.

chrisroat · 2021-10-19T23:14:21Z

I'm sorry to report that I'm still seeing deadlocking, using the repos as of a few hours ago. I see deadlocks using 2021.09.1, as well. My clusters autoscale, and I find I can mitigate deadlocking by making my clusters fullsize and turning off autoscaling (at the expense of running a lot of resources if there are straggler tasks).

dask: 2021.09.1+50.g7b1f3069
distributed: 2021.09.1+49.ga8151a69

I'm attaching the worker/scheduler dumps that @fjetter's script writes out.

worker_20211019.pkl.gz
scheduler_20211019.pkl.gz

chrisroat · 2021-10-20T14:37:01Z

FYI, I am using a dask-gateway setup when I'm talking about autoscaling. I'm not sure how the various dask distributed parameters interact with similarly named ones in the gateway, and if there is anything I could tweak in either set that might affect the problems I'm seeing.

fjetter · 2021-10-20T16:25:22Z

We do not expect current main to be stable. There is a fix avaiable which is not merged, yet, see #5426

I'll have a look at the dump you provided to see if I can relate it somehow

fjetter · 2021-10-20T16:56:31Z

I just realized that you've been using the branch of #5426 already. I can see a lot of exceptions popping up and because the scheduler instructs a worker to release a key too soon. That causes a KeyError and likely gets that worker stuck. I'll need a bit more time understanding why this key is released to soon but the cluster dump is helpful!

fjetter · 2021-10-20T16:58:39Z

From the cluster dump, I would assume that you're not stuck but the computations fail hard. Am I wrong about this?

chrisroat · 2021-10-20T17:39:33Z

The workflow is actually stuck. A few workers have many tasks, and many workers have no tasks. But there is no movement on the task graph.

I believe I am using main, so not sure I have anything related to #5426.

jrbourbeau · 2021-10-20T21:02:17Z

I believe I am using main, so not sure I have anything related to #5426.

In that case, would you mind trying #5426 out? There was another user that reported similar sounding deadlock behavior which #5426 fixes (see #5426 (comment))

fjetter · 2021-10-21T10:47:21Z

I believe I am using main, so not sure I have anything related to #5426.

Right, I saw a change in the logs but that was merged in main already.

I'm unfortunately struggling to make sense of the cluster dump you provided. For instance, worker dask-worker-b4fc9974f78346e89a89c52f8eed31fb-nkcnm is likely stuck because there was an exception while trying to schedule a task A. That exception occured since this task A was trying to load it's dependency B which is no longer in memory causing a KeyError. So far, so good. Trying to find out why B is no longer in memory, I investigate the scheduler and it doesn't know anything about A nor B.

Has this worker been reused? Scheduler restarted, same worker connects to a new scheduler and a new computation was scheduled on the new scheduler? From the logs, it doesn't look like it, at least the scheduler hostname + IP remained the same but that could just as well indicate a proxy between scheduler + worker. I'm not familiar with how dask-gateway works.

chrisroat · 2021-10-21T20:44:51Z

OK, I re-ran using your branch. I still am seeing deadlocking. On the dashboard, I see a lot of tasks piling up on a few workers. I had two clusters running, using workers with 6 threads. The clusters start with 10 workers, and can autoscale up to 50. A full scale up is triggered immediately based on the workload, but it can take a few minutes for GCP to spin up the nodes.

The first cluster hit a KilledWorker on its first graph. This happens sometimes because the work distribution on 10 nodes does not distribute well when nodes are added. (I could add more workers and/or reduce memory consumption.) This first cluster then ran a second graph. Since it was "warm" and nodes were provisioned, it distributed work nicely from the start. But it still has deadlocked, and 1 worker is sitting with 500 tasks.

Task graph of first cluster:

Debug info from first cluster
scheduler_20211021.pkl.gz
worker_20211021.pkl.gz

The 2nd cluster ran its first graph up until it deadlocked. It's debug info:

scheduler_20211021-2.pkl.gz
worker_20211021-2.pkl.gz

[Note that worker logs are on GDrive, as I think I was hitting max file size limits on GitHub]

chrisroat · 2021-10-26T16:02:37Z

@fjetter Is there anything I can do to help/explore the issue here?

chrisroat · 2021-11-03T20:13:15Z

@fjetter - Could this issue be somehow specific to dask-gateway setups?

fjetter · 2021-11-09T15:48:18Z

Sorry for the delay. Parsing these logs typically requires a bit of time. I had a peak at the 2nd cluster you provided logs for and could see a few interesting exceptions

The very first exception I can see is basically identical to KeyError in Worker.handle_compute_task (causes deadlock) #5482 which is known to cause a deadlock. This could already be it. It's unclear if the following exceptions are related or not.
In the following, I can see many exceptions around _prepare_args_for_execution which all indicate that a key was lost before it should be. I cannot confirm this but I strongly suspect this is the exact same problem as KeyError in Worker.handle_compute_task (causes deadlock) #5482
Lastly, there is an exception during a worker<->worker data fetch raising a FileNotFoundError. This is a bit weird since even if stuff is spilled, this should raise a keyerror. the only reason why that would raise a file not found is if some external process wiped the disk. Maybe I'm wrong but it also remotely connects to KeyError in Worker.handle_compute_task (causes deadlock) #5482

I am working on that over in #5507 but am still missing a reproducer so the "fix" might not be sufficient, yet.

xref #5481, #5457

chrisroat · 2021-11-09T21:50:38Z

Thanks for taking a look. Do you have anywhere on the web where I can tip you a few coffees for your work on this?

Let me know if I can help test in some way (i.e. merge 5507 and 5457?)

fjetter · 2021-11-10T14:51:32Z

Do you have anywhere on the web where I can tip you a few coffees for your work on this?

Keep that thought in case we ever meet at a conference ;) In the meantime, don't worry about it. I'm not doing this on my own time!

I'll ping here again once I have something of substance

chrisroat · 2021-12-15T18:47:17Z

I just ran 11 of the graphs with 96d4fd4 (2012.2.0+), and haven't seen a deadlock. This is great! Typically, well over half of these graphs get stuck/deadlocked.

While its hard to declare victory in cluster situations, as the background situation evolves based on availability/pre-emption, etc; I am hopeful!

I do appreciate all the work @fjetter is doing! Kudos.

I will be running more workloads over the coming days, and will report back.

chrisroat · 2022-01-02T00:52:39Z

Happy New Year!

On a sadder note, I wanted to provide an update that I do have a graph that just hit a deadlock. It has 36k tasks, and deadlocked for a couple hours with 1 task processing and 4 waiting. (Being near the end, I thought maybe it was I/O task that got wedged talking to GCS. Alas, it is a computational step.)

The worker logs have

2022-01-01 13:11:32.141 PST distributed.protocol.core - CRITICAL - Failed to deserialize
2022-01-01 13:11:32.141 PST Traceback (most recent call last): 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 111, in loads return msgpack.loads( 
File "msgpack/_unpacker.pyx", 
 line 194, in msgpack._cmsgpack.unpackb 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 103, in _decode_default
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 488, in merge_and_deserialize
2022-01-01 13:11:32.141 PST return deserialize(header, merged_frames, deserializers=deserializers)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 417, in deserialize
2022-01-01 13:11:32.141 PST return loads(header, frames)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 56, in dask_loads
2022-01-01 13:11:32.141 PST loads = dask_deserialize.dispatch(typ)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/dask/utils.py", line 614, in dispatch
2022-01-01 13:11:32.141 PST raise TypeError(f"No dispatch for {cls}")
2022-01-01 13:11:32.141 PSTTypeError: No dispatch for `class 'numpy.ndarray'`
2022-01-01 13:11:32.143 PSTdistributed.worker - ERROR - No dispatch for `class 'numpy.ndarray'`
2022-01-01 13:11:32.147 PSTTraceback (most recent call last): 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 681, in log_errors yield File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 2971, in gather_dep response = await get_data_from_worker( 
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 4259, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", 
 line 385, in retry_operation return await retry( 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", 
 line 370, in retry return await coro() 
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 4239, in _get_data response = await send_recv( 
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", 
 line 663, in send_recv response = await comm.read(deserializers=deserializers) 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", 
 line 233, in read msg = await from_frames( 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", 
 line 76, in from_frames res = await offload(_from_frames) 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 1317, in offload return await loop.run_in_executor( 
File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", 
 line 57, in run result = self.fn(*self.args, **self.kwargs) 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 1318, in  _offload_executor, lambda: context.run(fn, *args, **kwargs) 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", 
 line 61, in _from_frames return protocol.loads( 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 111, in loads return msgpack.loads( 
File "msgpack/_unpacker.pyx", 
 line 194, in msgpack._cmsgpack.unpackb 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 103, in _decode_default

I'm also attaching dumps from scheduler and worker:
worker_20220101.pkl.gz
scheduler_20220101.pkl.gz

chrisroat · 2022-01-05T05:31:33Z

I have now had several deadlocks occur, so I don't think we are out of the woods yet. In addition to the previous comment, here is another one. This one does not have the dispatch errors. The scheduler just has some messages like:

WARNING - Listener on 'tls://0.0.0.0:8786': TLS handshake failed with remote 'tls://10.16.129.2:56772': EOF occurred in violation of protocol (_ssl.c:1131)

INFO - Connection from tls://10.16.0.15:54470 closed before handshake completed

dump_ba2351b92f024e2ca986819e8a4d34af_20220105.msgpack.gz

I was able to get the graph to complete by deleting the pods of two workers that were processing tasks.

chrisroat mentioned this issue Sep 29, 2021

Worker state machine refactor #5046

Merged

7 tasks

fjetter mentioned this issue Sep 29, 2021

Race condition in work stealing resulting in deadlock #5370

Closed

jrbourbeau mentioned this issue Oct 7, 2021

Release 2021.10.0 dask/community#189

Closed

3 tasks

fjetter mentioned this issue Nov 18, 2021

Resolve KeyError-related deadlock #5525

Merged

gjoseph92 mentioned this issue Apr 26, 2022

[Discussion] Structured concurrency #6201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock - task not running #5366

Deadlock - task not running #5366

chrisroat commented Sep 28, 2021

chrisroat commented Sep 28, 2021

chrisroat commented Sep 29, 2021

fjetter commented Sep 29, 2021

chrisroat commented Sep 29, 2021

jrbourbeau commented Oct 15, 2021

chrisroat commented Oct 17, 2021

chrisroat commented Oct 19, 2021

chrisroat commented Oct 20, 2021

fjetter commented Oct 20, 2021

fjetter commented Oct 20, 2021

fjetter commented Oct 20, 2021

chrisroat commented Oct 20, 2021

jrbourbeau commented Oct 20, 2021

fjetter commented Oct 21, 2021

chrisroat commented Oct 21, 2021

chrisroat commented Oct 26, 2021

chrisroat commented Nov 3, 2021

fjetter commented Nov 9, 2021

chrisroat commented Nov 9, 2021

fjetter commented Nov 10, 2021 •

edited

chrisroat commented Dec 15, 2021 •

edited

chrisroat commented Jan 2, 2022

chrisroat commented Jan 5, 2022 •

edited

Deadlock - task not running #5366

Deadlock - task not running #5366

Comments

chrisroat commented Sep 28, 2021

chrisroat commented Sep 28, 2021

chrisroat commented Sep 29, 2021

fjetter commented Sep 29, 2021

chrisroat commented Sep 29, 2021

jrbourbeau commented Oct 15, 2021

chrisroat commented Oct 17, 2021

chrisroat commented Oct 19, 2021

chrisroat commented Oct 20, 2021

fjetter commented Oct 20, 2021

fjetter commented Oct 20, 2021

fjetter commented Oct 20, 2021

chrisroat commented Oct 20, 2021

jrbourbeau commented Oct 20, 2021

fjetter commented Oct 21, 2021

chrisroat commented Oct 21, 2021

chrisroat commented Oct 26, 2021

chrisroat commented Nov 3, 2021

fjetter commented Nov 9, 2021

chrisroat commented Nov 9, 2021

fjetter commented Nov 10, 2021 • edited

chrisroat commented Dec 15, 2021 • edited

chrisroat commented Jan 2, 2022

chrisroat commented Jan 5, 2022 • edited

fjetter commented Nov 10, 2021 •

edited

chrisroat commented Dec 15, 2021 •

edited

chrisroat commented Jan 5, 2022 •

edited