Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back worker reconnection #6391

Open
gjoseph92 opened this issue May 20, 2022 · 0 comments
Open

Add back worker reconnection #6391

gjoseph92 opened this issue May 20, 2022 · 0 comments

Comments

@gjoseph92
Copy link
Collaborator

If the network connection between the worker and scheduler was broken, workers used to try to re-connect and negotiate their state with the scheduler.

It turned out that the logic around re-estabilshing the network connection (#5481), re-negotiating the state (#6341), and handling the disconnect on the scheduler side (#6354) was all buggy and a source of deadlocks. Though disruptive, for short-term stability, we opted to remove the reconnection option entirely (#6350).

However, in the long term, we do want workers to be resilient to temporary network failures. We'll want to add worker reconnection back in once contracts around BatchedSend and worker disconnection are tightened up.

Requires:

Note that I'm intentionally not tracking this in #6384, since those are only meant to be short-term tasks. This is likely not something we'll tackle for a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant