P2P: Unresponsive worker causes P2P shuffle to transition to fail #8595

hendrikmakait · 2024-03-20T13:58:33Z

I've seen a P2P shuffle fail during TPC-H benchmarking on Coiled because a worker became unresponsive. While the built-in retry mechanism for P2P data transfer kicked in as intended, the broadcast within the P2P barrier timed out and caused the barrier task to fail, which in turn caused the computation to fail.

Scheduler logs:

2024-03-20 13:11:00.572000 distributed.shuffle._scheduler_plugin - WARNING - Shuffle beea560b8bd40fc4de6e6b669d7ed149 
initialized by task ('hash-join-transfer-beea560b8bd40fc4de6e6b669d7ed149', 945) executed on worker tls://10.0.17.70:36095
2024-03-20 13:12:00.360000 distributed.scheduler - ERROR - broadcast to tls://10.0.25.92:36145 failed: OSError: Timed out 
trying to connect to tls://10.0.25.92:36145 after 30 s
2024-03-20 13:12:00.362000 distributed.core - ERROR - Exception while handling op shuffle_barrier
  Traceback (most recent call last):
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/utils.py", line 1935, in wait_for
      return await fut
             ^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/comm/tcp.py", line 546, in connect
      stream = await self.client.connect(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/tornado/tcpclient.py", line 292, in connect
      stream = await stream.start_tls(
               ^^^^^^^^^^^^^^^^^^^^^^^
  asyncio.exceptions.CancelledError
  The above exception was the direct cause of the following exception:
  Traceback (most recent call last):
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/comm/core.py", line 342, in connect
      comm = await wait_for(
             ^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/utils.py", line 1934, in wait_for
      async with asyncio.timeout(timeout):
    File "/opt/coiled/env/lib/python3.11/asyncio/timeouts.py", line 115, in __aexit__
      raise TimeoutError from exc_val
  TimeoutError
  The above exception was the direct cause of the following exception:
  Traceback (most recent call last):
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/core.py", line 970, in _handle_comm
      result = await result
               ^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/shuffle/_scheduler_plugin.py", line 96, in barrier
      await self.scheduler.broadcast(
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 6389, in broadcast
      results = await All()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/utils.py", line 251, in All
      result = await tasks.next()
               ^^^^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py", line 6364, in send_message
      comm = await self.rpc.connect(addr)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/core.py", line 1620, in connect
      return await self._connect(addr=addr, timeout=timeout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/core.py", line 1564, in _connect
      comm = await connect(
             ^^^^^^^^^^^^^^
    File "/opt/coiled/env/lib/python3.11/site-packages/distributed/comm/core.py", line 368, in connect
      raise OSError(
  OSError: Timed out trying to connect to tls://10.0.25.92:36145 after 30 s
2024-03-20 13:12:00.487000 distributed.shuffle._scheduler_plugin - WARNING - Shuffle beea560b8bd40fc4de6e6b669d7ed149 
deactivated due to stimulus 'task-erred-1710940320.4399164'
2024-03-20 13:12:00.501000 distributed.shuffle._scheduler_plugin - WARNING - Shuffle 17f21dfad0311f33b21dbe81197ac0be 
deactivated due to stimulus 'task-erred-1710940320.4399164'
2024-03-20 13:12:00.504000 distributed.shuffle._scheduler_plugin - WARNING - Shuffle 95a044ea89a359cb067e08d3c664323c 
deactivated due to stimulus 'task-erred-1710940320.4399164'
2024-03-20 13:12:00.701000 distributed.shuffle._scheduler_plugin - WARNING - Shuffle d93a2ed9984df2d8be33c6540a84cf6c 
deactivated due to stimulus 'client-releases-keys-1710940320.5152714'

Instead of failing hard in this scenario, P2P should probably retry the broadcast once or twice and restart the entire shuffle without the struggling worker.

Alternatively, P2P could retry until the worker-ttl mechanism kicks in and drops the unresponsive worker. This would be less intrusive but might still fail for straggling workers that are not completely unresponsive.

hendrikmakait added the shuffle label Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P2P: Unresponsive worker causes P2P shuffle to transition to fail #8595

P2P: Unresponsive worker causes P2P shuffle to transition to fail #8595

hendrikmakait commented Mar 20, 2024 •

edited

P2P: Unresponsive worker causes P2P shuffle to transition to fail #8595

P2P: Unresponsive worker causes P2P shuffle to transition to fail #8595

Comments

hendrikmakait commented Mar 20, 2024 • edited

hendrikmakait commented Mar 20, 2024 •

edited