You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've seen a P2P shuffle fail during TPC-H benchmarking on Coiled because a worker became unresponsive. While the built-in retry mechanism for P2P data transfer kicked in as intended, the broadcast within the P2P barrier timed out and caused the barrier task to fail, which in turn caused the computation to fail.
Instead of failing hard in this scenario, P2P should probably retry the broadcast once or twice and restart the entire shuffle without the struggling worker.
Alternatively, P2P could retry until the worker-ttl mechanism kicks in and drops the unresponsive worker. This would be less intrusive but might still fail for straggling workers that are not completely unresponsive.
The text was updated successfully, but these errors were encountered:
I've seen a P2P shuffle fail during TPC-H benchmarking on Coiled because a worker became unresponsive. While the built-in retry mechanism for P2P data transfer kicked in as intended, the broadcast within the P2P barrier timed out and caused the barrier task to fail, which in turn caused the computation to fail.
Scheduler logs:
Instead of failing hard in this scenario, P2P should probably retry the broadcast once or twice and restart the entire shuffle without the struggling worker.
Alternatively, P2P could retry until the worker-ttl mechanism kicks in and drops the unresponsive worker. This would be less intrusive but might still fail for straggling workers that are not completely unresponsive.
The text was updated successfully, but these errors were encountered: