Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock - task not running #5366

Open
chrisroat opened this issue Sep 28, 2021 · 23 comments
Open

Deadlock - task not running #5366

chrisroat opened this issue Sep 28, 2021 · 23 comments

Comments

@chrisroat
Copy link
Contributor

What happened:

When running on an auto-scaling GKE cluster using dask-gateway, I sometimes find computation halting mid-graph. One or more workers will have tasks, but not actual by doing any work. Often the logs contain a traceback from some sort of failure.

@fjetter

What you expected to happen:

Graph to finish to completion.

Minimal Complete Verifiable Example:

I don't have a reproducable example, as the graph can sometimes succeed and will often succeed if I kill the worker(s) that are stuck. I am including scheduler and worker info as per @fjetter 's script (#5068), as well as GKE logs showing a traceback earlier in the process.

Anything else we need to know?:

K8s log
distributed.worker - ERROR - failed during get data with tls://10.16.132.2:38819 -> tls://10.16.95.2:33407

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

distributed.core - INFO - Lost connection to 'tls://10.16.95.2:44406'
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 495, in handle_comm result = await result File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

Environment:

  • Dask version: 2021.09.1+9.g9f587507
  • Distributed version: 2021.09.1+14.gef281377
  • Python version: python3.8
  • Operating System: ubuntu 20.04
  • Install method (conda, pip, source): pip

scheduler.pkl.gz
worker.pkl.gz

@chrisroat
Copy link
Contributor Author

FYI -- if I go to the "Call Stack" for the task in the Dask UI, I find that it says

"Task not actively running. It may be finished or not yet started"

@chrisroat
Copy link
Contributor Author

FWIW, if I turn off autoscaling and keep the clusters full size, I haven't seen deadlocks. Could it be that the extra communication from autoscaling is interfering? I will investigate if I can adjust the rate at which the autoscaling operates.

@fjetter
Copy link
Member

fjetter commented Sep 29, 2021

thanks for the report and the cluster dump. After a brief look at the dump, this looks like the issue I'm currently investigating (reported offline). This is very helpful! I'll let you know once there is a fix available.

How reliable can you reproduce this? Does this happen on every run or only every ~100th?

@chrisroat
Copy link
Contributor Author

I'm glad it is useful. It seems to happen fairly regularly, if I'm letting the cluster fluctuate in size.

Another data point: I did a test with 2021.9.1 with autoscaling set with min=max so the cluster would remain constant size, and I didn't see any deadlocking. I thought this might disambiguate if the problem was the auto-scale communication versus the up/down of workers. But if you may already have a root cause, that's better news.

@jrbourbeau
Copy link
Member

@chrisroat if you get a chance could you try again with the latest main branch of distributed? We had hoped that #5379 would fix this deadlock, but wanted to confirm whether or not this is the case

@chrisroat
Copy link
Contributor Author

I will try it out sometime this week, before your planned release on Friday.

@chrisroat
Copy link
Contributor Author

I'm sorry to report that I'm still seeing deadlocking, using the repos as of a few hours ago. I see deadlocks using 2021.09.1, as well. My clusters autoscale, and I find I can mitigate deadlocking by making my clusters fullsize and turning off autoscaling (at the expense of running a lot of resources if there are straggler tasks).

dask: 2021.09.1+50.g7b1f3069
distributed: 2021.09.1+49.ga8151a69

I'm attaching the worker/scheduler dumps that @fjetter's script writes out.

worker_20211019.pkl.gz
scheduler_20211019.pkl.gz

@chrisroat
Copy link
Contributor Author

FYI, I am using a dask-gateway setup when I'm talking about autoscaling. I'm not sure how the various dask distributed parameters interact with similarly named ones in the gateway, and if there is anything I could tweak in either set that might affect the problems I'm seeing.

@fjetter
Copy link
Member

fjetter commented Oct 20, 2021

We do not expect current main to be stable. There is a fix avaiable which is not merged, yet, see #5426

I'll have a look at the dump you provided to see if I can relate it somehow

@fjetter
Copy link
Member

fjetter commented Oct 20, 2021

I just realized that you've been using the branch of #5426 already. I can see a lot of exceptions popping up and because the scheduler instructs a worker to release a key too soon. That causes a KeyError and likely gets that worker stuck. I'll need a bit more time understanding why this key is released to soon but the cluster dump is helpful!

@fjetter
Copy link
Member

fjetter commented Oct 20, 2021

From the cluster dump, I would assume that you're not stuck but the computations fail hard. Am I wrong about this?

@chrisroat
Copy link
Contributor Author

The workflow is actually stuck. A few workers have many tasks, and many workers have no tasks. But there is no movement on the task graph.

I believe I am using main, so not sure I have anything related to #5426.

@jrbourbeau
Copy link
Member

I believe I am using main, so not sure I have anything related to #5426.

In that case, would you mind trying #5426 out? There was another user that reported similar sounding deadlock behavior which #5426 fixes (see #5426 (comment))

@fjetter
Copy link
Member

fjetter commented Oct 21, 2021

I believe I am using main, so not sure I have anything related to #5426.

Right, I saw a change in the logs but that was merged in main already.


I'm unfortunately struggling to make sense of the cluster dump you provided. For instance, worker dask-worker-b4fc9974f78346e89a89c52f8eed31fb-nkcnm is likely stuck because there was an exception while trying to schedule a task A. That exception occured since this task A was trying to load it's dependency B which is no longer in memory causing a KeyError. So far, so good. Trying to find out why B is no longer in memory, I investigate the scheduler and it doesn't know anything about A nor B.

Has this worker been reused? Scheduler restarted, same worker connects to a new scheduler and a new computation was scheduled on the new scheduler? From the logs, it doesn't look like it, at least the scheduler hostname + IP remained the same but that could just as well indicate a proxy between scheduler + worker. I'm not familiar with how dask-gateway works.

@chrisroat
Copy link
Contributor Author

OK, I re-ran using your branch. I still am seeing deadlocking. On the dashboard, I see a lot of tasks piling up on a few workers. I had two clusters running, using workers with 6 threads. The clusters start with 10 workers, and can autoscale up to 50. A full scale up is triggered immediately based on the workload, but it can take a few minutes for GCP to spin up the nodes.

The first cluster hit a KilledWorker on its first graph. This happens sometimes because the work distribution on 10 nodes does not distribute well when nodes are added. (I could add more workers and/or reduce memory consumption.) This first cluster then ran a second graph. Since it was "warm" and nodes were provisioned, it distributed work nicely from the start. But it still has deadlocked, and 1 worker is sitting with 500 tasks.

Task graph of first cluster:
Screen Shot 2021-10-21 at 1 11 18 PM

Debug info from first cluster
scheduler_20211021.pkl.gz
worker_20211021.pkl.gz

The 2nd cluster ran its first graph up until it deadlocked. It's debug info:

scheduler_20211021-2.pkl.gz
worker_20211021-2.pkl.gz

[Note that worker logs are on GDrive, as I think I was hitting max file size limits on GitHub]

@chrisroat
Copy link
Contributor Author

@fjetter Is there anything I can do to help/explore the issue here?

@chrisroat
Copy link
Contributor Author

@fjetter - Could this issue be somehow specific to dask-gateway setups?

@fjetter
Copy link
Member

fjetter commented Nov 9, 2021

Sorry for the delay. Parsing these logs typically requires a bit of time. I had a peak at the 2nd cluster you provided logs for and could see a few interesting exceptions

I am working on that over in #5507 but am still missing a reproducer so the "fix" might not be sufficient, yet.

xref #5481, #5457

@chrisroat
Copy link
Contributor Author

Thanks for taking a look. Do you have anywhere on the web where I can tip you a few coffees for your work on this?

Let me know if I can help test in some way (i.e. merge 5507 and 5457?)

@fjetter
Copy link
Member

fjetter commented Nov 10, 2021

Do you have anywhere on the web where I can tip you a few coffees for your work on this?

Keep that thought in case we ever meet at a conference ;) In the meantime, don't worry about it. I'm not doing this on my own time!

I'll ping here again once I have something of substance

@chrisroat
Copy link
Contributor Author

chrisroat commented Dec 15, 2021

I just ran 11 of the graphs with 96d4fd4 (2012.2.0+), and haven't seen a deadlock. This is great! Typically, well over half of these graphs get stuck/deadlocked.

While its hard to declare victory in cluster situations, as the background situation evolves based on availability/pre-emption, etc; I am hopeful!

I do appreciate all the work @fjetter is doing! Kudos.

I will be running more workloads over the coming days, and will report back.

@chrisroat
Copy link
Contributor Author

Happy New Year!

On a sadder note, I wanted to provide an update that I do have a graph that just hit a deadlock. It has 36k tasks, and deadlocked for a couple hours with 1 task processing and 4 waiting. (Being near the end, I thought maybe it was I/O task that got wedged talking to GCS. Alas, it is a computational step.)

The worker logs have

2022-01-01 13:11:32.141 PST distributed.protocol.core - CRITICAL - Failed to deserialize
2022-01-01 13:11:32.141 PST Traceback (most recent call last): 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 111, in loads return msgpack.loads( 
File "msgpack/_unpacker.pyx", 
 line 194, in msgpack._cmsgpack.unpackb 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 103, in _decode_default
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 488, in merge_and_deserialize
2022-01-01 13:11:32.141 PST return deserialize(header, merged_frames, deserializers=deserializers)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 417, in deserialize
2022-01-01 13:11:32.141 PST return loads(header, frames)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 56, in dask_loads
2022-01-01 13:11:32.141 PST loads = dask_deserialize.dispatch(typ)
2022-01-01 13:11:32.141 PST File "/opt/conda/lib/python3.8/site-packages/dask/utils.py", line 614, in dispatch
2022-01-01 13:11:32.141 PST raise TypeError(f"No dispatch for {cls}")
2022-01-01 13:11:32.141 PSTTypeError: No dispatch for `class 'numpy.ndarray'`
2022-01-01 13:11:32.143 PSTdistributed.worker - ERROR - No dispatch for `class 'numpy.ndarray'`
2022-01-01 13:11:32.147 PSTTraceback (most recent call last): 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 681, in log_errors yield File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 2971, in gather_dep response = await get_data_from_worker( 
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 4259, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", 
 line 385, in retry_operation return await retry( 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", 
 line 370, in retry return await coro() 
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", 
 line 4239, in _get_data response = await send_recv( 
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", 
 line 663, in send_recv response = await comm.read(deserializers=deserializers) 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", 
 line 233, in read msg = await from_frames( 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", 
 line 76, in from_frames res = await offload(_from_frames) 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 1317, in offload return await loop.run_in_executor( 
File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", 
 line 57, in run result = self.fn(*self.args, **self.kwargs) 
File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", 
 line 1318, in  _offload_executor, lambda: context.run(fn, *args, **kwargs) 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", 
 line 61, in _from_frames return protocol.loads( 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 111, in loads return msgpack.loads( 
File "msgpack/_unpacker.pyx", 
 line 194, in msgpack._cmsgpack.unpackb 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", 
 line 103, in _decode_default

I'm also attaching dumps from scheduler and worker:
worker_20220101.pkl.gz
scheduler_20220101.pkl.gz

@chrisroat
Copy link
Contributor Author

chrisroat commented Jan 5, 2022

I have now had several deadlocks occur, so I don't think we are out of the woods yet. In addition to the previous comment, here is another one. This one does not have the dispatch errors. The scheduler just has some messages like:

WARNING - Listener on 'tls://0.0.0.0:8786': TLS handshake failed with remote 'tls://10.16.129.2:56772': EOF occurred in violation of protocol (_ssl.c:1131)
INFO - Connection from tls://10.16.0.15:54470 closed before handshake completed

dump_ba2351b92f024e2ca986819e8a4d34af_20220105.msgpack.gz

I was able to get the graph to complete by deleting the pods of two workers that were processing tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants