Properly support restarting `BatchedSend` #5481

gjoseph92 · 2021-10-29T19:11:48Z

In #5480 we found that when Worker.batched_stream is restarted after a broken connection to the scheduler, it enters a broken state where send succeeds, but doesn't actually send data (it just sits in the buffer forever).

This is probably fixed by #5457, but I think it may deserve a more thorough fix. If we want BatchedSend to be restartable, it should have a clear interface and tests for this.

xref #4133 #4163 #5377

The text was updated successfully, but these errors were encountered:

fjetter · 2021-11-02T13:36:52Z

If we want BatchedSend to be restartable, it should have a clear interface [...]

How would that look like and what value would this deliver? So far, the restart by reconnect starts the BatchedComm again and the BatchedComm takes care of the rest. I'm wondering what value a "clear" interface would yield. It feels like every change to this interface would make the code using it more complex.

[...] and tests for this.

I personally consider the tests added in #5457 to be sufficient and if something is missing, let's discuss it over there.

Test for the worker disconnecting and restarting the batched comm. Ensure the payload is not lost and is properly submitted in the correct order after reconnect test_worker.py::test_dont_lose_payload_reconnect
The batched comm itself is ensured to retain any potentially lost payload if disconnected. upon reconnect the payload is properly submitted. test_batched.py:: test_retain_buffer_commclosed

what is missing?

gjoseph92 · 2022-05-05T15:40:54Z

Just for historical understanding—I believe this was sort of introduced in #3493. Before that, we were always creating a new BatchedSend instead of restarting the existing one:

distributed/distributed/worker.py

Lines 866 to 867 in 2a05299

    
           self.batched_stream = BatchedSend(interval="2ms", loop=self.loop) 
        
           self.batched_stream.start(comm)

Basic assert statements and documentation in BatchedSend.start would have made it clear to future developers that BatchedSend.start could not be called multiple times and prevented this bug. That's what I meant by a clear interface. Then that PR wouldn't have been able to so easily misuse a 5-year-old API and introduce this subtle error condition.

gjoseph92 · 2022-05-24T20:00:56Z

With worker reconnection removed, this is no longer necessary #6361. The API and internal validation should still be tightened up though: #6389.

This was referenced Oct 29, 2021

Worker reconnection deadlock #5480

Closed

Is it intended that any error from a handler makes Server.handle_stream close the comm? #5483

Open

This was referenced Nov 3, 2021

Do not drop BatchedSend payload if worker reconnects #5457

Closed

KeyError in Worker.handle_compute_task (causes deadlock) #5482

Closed

jcrist mentioned this issue Nov 8, 2021

Use asyncio for TCP/TLS comms #5450

Merged

fjetter mentioned this issue Nov 9, 2021

Deadlock - task not running #5366

Open

gjoseph92 mentioned this issue Jan 20, 2022

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

Open

fjetter mentioned this issue Jan 21, 2022

Conditions under which a TCP connection may fail / close? #5678

Closed

gjoseph92 mentioned this issue Apr 12, 2022

Include BatchedSend state in cluster dumps #6114

Open

This was referenced Apr 26, 2022

[Discussion] Structured concurrency #6201

Open

Add fail_hard decorator for worker methods #6210

Merged

This was referenced May 5, 2022

Fix batchedsend restart #6272

Closed

Make BatchedSend restartable #6329

Closed

gjoseph92 mentioned this issue May 20, 2022

Add back worker reconnection #6391

Open

gjoseph92 closed this as completed May 24, 2022

gjoseph92 closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly support restarting `BatchedSend` #5481

Properly support restarting `BatchedSend` #5481

gjoseph92 commented Oct 29, 2021

fjetter commented Nov 2, 2021

gjoseph92 commented May 5, 2022 •

edited

gjoseph92 commented May 24, 2022

Properly support restarting BatchedSend #5481

Properly support restarting BatchedSend #5481

Comments

gjoseph92 commented Oct 29, 2021

fjetter commented Nov 2, 2021

gjoseph92 commented May 5, 2022 • edited

gjoseph92 commented May 24, 2022

Properly support restarting `BatchedSend` #5481

Properly support restarting `BatchedSend` #5481

gjoseph92 commented May 5, 2022 •

edited