Do not drop BatchedSend payload if worker reconnects #5457

fjetter · 2021-10-22T12:58:06Z

If the scheduler<->worker stream closes during a background send or a background send is attempted although the connection was already closed, we'd lose any payload data tried to be submitted during that iteration.

I could trace the flakiness of test_worker_reconnects_mid_compute_multiple_states_on_scheduler back to this condition.

By simply prepending the payload to the buffer upon failure, there worker has a chance to resubmit this upon reconnect since the BatchedSend instance is the same

Closes #5377

Dedicated unit test

fjetter · 2021-10-22T14:49:43Z

I spoke too soon, it seems. CI is still failing for test_worker_reconnects_mid_compute_multiple_states_on_scheduler even though I can no longer reproduce the timeout with this fix locally :(

https://github.com/dask/distributed/runs/3976598171?check_suite_focus=true

EDIT: Didn't see it again. I'm inclined to ignore this instance since other deadlocky fixes were merged on main and are part of this now

distributed/batched.py

distributed/tests/test_worker.py

crusaderky · 2021-10-29T12:04:15Z

distributed/tests/test_worker.py

+    # Payload that couldn't be submitted is prepended
+    assert len(after) >= len(before)
+    assert after[: len(before)] == before


If everything went through before the close, then you're not testing anything useful.
From what I can read here it might as well be what happens every time.
The opposite might also be true - everything goes through AFTER the close.

Could you change to

assert before) assert len(after) > len(before)

while possibly drastically increasing the number of messages in the pipeline to avoid flakiness?

assert before is already asserted but as assert len(w.batched_stream.buffer) == 100 in L3218

Increasing the number of messages sent should not affect anything since the batched send is implemented as all or nothing

Actually the most probably scenario is after == before since you only get a difference if something in the scheduler sent another message while the background send is awaiting the write to the closed comm which is an incredibly tiny time window. At first I had the equal assert only but that caused this test to fail every now and then. I added the larger equal to avoid flakiness. Equal is fine. after > before is fine iff the first len(before) messages are the same as before.

If the payload would've been submitted before closing, the buffer would be empty before I am trying to send.

If we want to absolutely have > the only way I see is to schedule futures trying to send smth simultaneously to the await background_send in L 3220 but that would render our buffer assert impossible since we no longer can say for sure what the state before actually was.

Can you explicitly call batched_stream.send(..) after the comm has been closed, and test that it has the correct behavior? (As I mentioned in another comment I don't know what this behavior should be; raise an error or just add to the buffer?)

That's been done as part of test_send_before_close. I didn't change the behaviour but extended the test a bit. The behaviour is as follows

If a send is successful, the message is appended to the buffer. The BatchedSend is now responsible for delivery. The responsibility of the server is to check periodically and restart by providing a new comm if necessary.

If an exception is raised, the code should raise immediately and not append to the buffer. Whatever logic was trying to send a message can then decide how the failure should be handled.

To me, that's sensible behaviour

gjoseph92

I believe this PR will close #5480 as it stands, which is great. However I do still think some cleanup of BatchedSend would be worthwhile to ensure other usage errors like it don't come up in the future.

What I meant about a "clear interface" in #5481 was just that the function contracts and invariants around how BatchedSend should behave are not well documented, and in most cases not validated by the functions themselves. This leads to effectively undefined behavior. Before this PR, calling start multiple times could be called undefined behavior. Now, it has the behavior we want, and there are tests for it, but I'd still say it's not that well defined, since there isn't yet:
a. documentation for it
b. validation for it within the class, to prevent future developers from misusing this behavior somehow
c. tests for the validation

In these comments I'm noting some things about BatchedSend that I think should be documented and tightened up. Most of them aren't part of the changes you've made here, but hopefully explain what I mean by wanting a clearly defined interface.

gjoseph92 · 2021-11-02T16:46:09Z

distributed/batched.py

@@ -58,6 +58,7 @@ def __init__(self, interval, loop=None, serializers=None):

    def start(self, comm):
        self.comm = comm
+        self.please_stop = False
        self.loop.add_callback(self._background_send)

    def closed(self):


please_stop and stopped should probably factor into this as well

I still think this check is insufficient for telling whether we're closed or not. While we're in the process of closing (please_stop is True, or self.stopped is set, or _background_send is no longer running, etc.), it may still return True.

From the way this is used though, in both start and write, we probably want to treat "in the process of closing" as closed, not as running. Restarting a BatchedSend that's closing should be an error. If writing to a closed BatchedSend is an error, then so should be writing to one that's in the process of closing.

gjoseph92 · 2021-11-02T16:48:53Z

distributed/batched.py

-                payload = None  # lose ref
+                # If anything failed we should not lose payload. If a new comm
+                # is provided we can still resubmit messages
+                self.buffer = payload + self.buffer


Are you sure we should retain payload? Couldn't those messages have been successfully sent, and the error occurred after? Then we might be duplicating them when we reconnect.

Though maybe we'd rather duplicate messages than drop them. In that case, let's add a note saying so.

At this point I am assuming a certain behaviour of the comm. Either the comm writes all or nothing. That's likely not always true but I believe we cannot do much about it on this level of abstraction. imho, that guarantee should be implemented by our protocol and/or Comm interface.

Either way, I'm happy if you have any suggestions to improve this

After reading through TCP comm code, I don't think there's anywhere for an exception to happen after all of the payload has been sent. An exception could still happen when part has been sent and part hasn't, but either way, since we have to assume something here, I think it's more useful to assume that the payload hasn't been sent.

@jcrist and I were discussing this, and given the way the BatchedSend interface works, it actually needs to implement some sort of protocol with an ack from the receiver for each sent batch to guarantee messages can't be dropped. Since send on a BatchedSend is nonblocking, the caller is basically handing off full responsibility for the message to BatchedSend. If the message fails to send, it's too late to raise an error and let the caller figure out what to do about it—once we've been given a message, we have to ensure it's delivered. So the logical thing to do would be for the receiving side to ack each message, and only when ack'd does the sender drop the payload (with some deduplication of course).

OTOH there are lots of protocols out there for doing things like this, more performantly, robustly, and with better testing than we'll ever have. As with other things (framing in serialization), maybe the better solution is to stop duplicating functionality at the application level that should be the transport layer's job.

Could we get rid of BatchedSend entirely with some well-tuned TCP buffering settings + a serialization scheme that was more efficient for many small messages?

gjoseph92 · 2021-11-02T16:49:53Z

distributed/batched.py

@@ -152,6 +155,7 @@ def close(self, timeout=None):
        self.please_stop = True
        self.waker.set()
        yield self.stopped.wait(timeout=timeout)
+        payload = []


could move this below the if not self.comm.closed():

gjoseph92 · 2021-11-02T16:56:13Z

distributed/batched.py

@@ -58,6 +58,7 @@ def __init__(self, interval, loop=None, serializers=None):

    def start(self, comm):
        self.comm = comm
+        self.please_stop = False


Suggested change

self.please_stop = False

self.please_stop = False

self.stopped.clear()

self.waker.clear()

What I mean by a "clear interface" is that I'd like a docstring for start saying the contract for what it can/can't do. For example, that start on a closed BatchedSend will restart it with a new comm. And that start on a currently-running BatchedSend raises an error, etc. What are the invariants for this internal state? How do all the methods ensure they are upheld?

gjoseph92 · 2021-11-02T16:59:45Z

distributed/worker.py

@@ -1238,6 +1238,7 @@ async def handle_scheduler(self, comm):
                comm, every_cycle=[self.ensure_communicating, self.ensure_computing]
            )
        except Exception as e:
+            self.batched_stream.please_stop = True


Could we not mutate internal state of BatchedSend? This feels brittle. I see this is going to do something slightly different from batched_stream.close(). If we need the ability to do both things, let's have a method for doing whatever this is as well (request_stop?).

Though I'd prefer having just one close() function. If the behavior of the current close isn't useful to calling code, perhaps we should change its behavior to something that is useful.

That was actually not even necessary. The BatchedSend infers automatically that the Comm is closed and there is no need to close anything

gjoseph92 · 2021-11-03T02:41:13Z

distributed/batched.py

-                pass
+                # If we're closing and there is an error there is little we
+                # can do about this to recover.
+                logger.error("Lost %i payload messages.", len(payload))
            yield self.comm.close()

    def abort(self):


Should abort be part of the public interface? When/how should callers use abort vs close? What can and cannot be done with a BatchedSend after abort (or close) has been called? There are all things I'd like to see documented.

gjoseph92 · 2021-11-03T02:57:58Z

distributed/batched.py

@@ -121,7 +125,6 @@ def _background_send(self):
        # there was an exception when using `comm`.
        # We can't close gracefully via `.close()` since we can't send messages.
        # So we just abort.
-        # This means that any messages in our buffer our lost.
        # To propagate exceptions, we rely on subsequent `BatchedSend.send`
        # calls to raise CommClosedErrors.
        self.stopped.set()


Can't comment there, but this is a comment for send:

if self.comm is not None and self.comm.closed(): raise CommClosedError(f"Comm {self.comm!r} already closed.")

This check feels insufficient. A more thorough check would have caught #5481. Should we also check not self.closed()? not self.please_stop? That the _background_send coroutine is still running? Things like that? It seems inconsistent to prohibit sends when the underlying comm is closed, but still allow them if the BatchedSend itself is effectively closed.

On the other hand, we may be relying on the fact that messages can be enqueued to a BatchedSend even when it's closed. If we expect the BatchedSend to be restarted and reused soon, perhaps there should be no restrictions on when send can be called.

I'd like to:

Figure out exactly what the behavior is we actually want

Write a docstring explaining the contract for when send can and cannot be called, how it behaves in the case of an open vs a closed underlying comm, an open vs closed BatchedSend, etc.

Make the code match this contract

Test that it does so

gjoseph92 · 2021-11-03T03:01:37Z

distributed/batched.py

@@ -58,6 +58,7 @@ def __init__(self, interval, loop=None, serializers=None):

    def start(self, comm):
        self.comm = comm
+        self.please_stop = False
        self.loop.add_callback(self._background_send)


One downside of using Tornado here is that we have no handle on this coroutine. If we did asyncio.create_task or similar, we could keep a Task object here.

That might let us simplify the please_stop/stopped/abort logic by just letting us do task.cancel() and handling the CancelledError within _background_send. We would probably want to shield the comm.write call from cancellation if we did so.

We could assert in send and elsewhere that the _background_send Task is not done(). If we're also checking please_stop, etc. this may be overkill, but it's a nice sanity check that something hasn't gone horribly wrong. The fact that the coroutine wasn't running was the cause of Properly support restarting BatchedSend #5481; a check for this would have made the bug much easier to find.

gjoseph92 · 2021-11-03T03:17:43Z

distributed/tests/test_worker.py

+    # Payload that couldn't be submitted is prepended
+    assert len(after) >= len(before)
+    assert after[: len(before)] == before


Can you explicitly call batched_stream.send(..) after the comm has been closed, and test that it has the correct behavior? (As I mentioned in another comment I don't know what this behavior should be; raise an error or just add to the buffer?)

gjoseph92 · 2021-11-03T03:18:54Z

distributed/tests/test_worker.py

+    assert len(after) >= len(before)
+    assert after[: len(before)] == before
+
+    await w.heartbeat()


I assume this is to make the worker reconnect to the scheduler. Is there a way we could wait for that to happen naturally, to test that it does so properly? Or do we have other good tests around worker reconnection that we're confident in?

Yes, the heartbeat triggers a reconnect. There are
test_worker_reconnects_mid_compute
and
test_worker_reconnects_mid_compute_multiple_states_on_scheduler
which test that the worker reconnects and corrects the scheduler state accordingly if any tasks are on the cluster. I consider testing the reconnect itself out of scope for this.

fjetter · 2021-11-04T18:05:13Z

distributed/comm/tcp.py

+            await stream.write(b"")
+            # FIXME: How do I test this? Why is the stream closed _sometimes_?
+            # Diving into tornado, so far, I can only confirm that once the
+            # write future has been awaited, the entire buffer has been written
+            # to the socket. Not sure if one loop iteration is sufficient in
+            # general or just sufficient for the local tests I've been running
+            await asyncio.sleep(0)
+            if stream.closed():
+                raise StreamClosedError()


Fixing this problem, reusing the buffer, etc. is relatively easy to do assuming we can rely on the Com.write to raise if something is wrong. Tornado ensures that if we await BaseIOStream.write, the data is written to the socket. I dove into the code and can confirm this. To the best of my knowledge, that's working properly.
However, there is a timewindow where the data is written to the socket, the connection is closed, the Comm.write returns control but the data was never submitted. I can catch these situations for my local unit tests by waiting a loop iteration but I'm not sure what the mechanism behind this is and whether or not this is reliable.

cc @jcrist @gjoseph92

FWIW, I think even without this asyncio.sleep(0) this fix should fix >99% of our problems. For a full fix we might need to reconsider the possibility for workers to reconnect statefully or introduce another way to guarantee message delivery.

@fjetter to rephrase what you're saying:

awaiting the stream.write is a big improvement, because it means that comm.write doesn't return until the message has been successfully written to the socket buffer

That's still not enough to guarantee the message has actually been sent, because the socket could get closed after the message is handed off the buffer, but before it's actually sent over the network.

For a full fix we might need to reconsider the possibility for workers to reconnect statefully or introduce another way to guarantee message delivery.

I agree. Like I mentioned in another comment, I think we need a better protocol (either application-level or more careful use of TCP) to guarantee all messages are delivered through reconnects.

That's what I figured but I was confused why I never encountered this for ordinary comms.

gjoseph92 · 2021-11-04T18:27:23Z

distributed/batched.py

+        """
+        if self.closed():
+            if comm.closed():
+                raise RuntimeError("Comm already closed.")


Suggested change

raise RuntimeError("Comm already closed.")

raise RuntimeError(f"Tried to start BatchedSend with an already-closed comm: {comm!r}.")

gjoseph92 · 2021-11-04T18:40:50Z

distributed/batched.py

+                raise RuntimeError("Comm already closed.")
+            self.comm = comm
+            self.please_stop = False
+            self.loop.add_callback(self._background_send)


If self.closed is True, can we be certain this callback is not already running? Two _background_sends running at the same time might make a mess. I think there's currently a race condition:

_background_send is awaiting self.waker.wait

comm gets closed (nothing happens to BatchedSend state at this point; a closed comm is only detected once a write to it fails)

start is called with a new comm, launching a new _background_send coroutine

both coroutines are now blocking on self.waker.wait; it's a race condition which gets awakened first

both can run validly; from their perspective, nothing even happened (the comm was switched out under their noses while they were sleeping)

I'm not actually sure if, with the exact way the code is written now, two coroutines running at once can actually do something bad, but it still seems like a bad and brittle situation. "only one _background_send is running at once" feels like a key invariant of this class to me.

This is another way in which having an asyncio Task handle on the _background_send might make things easier to reason about #5457 (comment)

gjoseph92 · 2021-11-04T19:41:14Z

distributed/batched.py

+            if comm.closed():
+                raise RuntimeError("Comm already closed.")
+            self.comm = comm
+            self.please_stop = False


Suggested change

self.please_stop = False

self.please_stop = False

self.stopped.clear()

although I also find the use of stopped a little odd, see the comment on cancel

gjoseph92 · 2021-11-04T19:52:18Z

distributed/batched.py

@@ -58,6 +58,7 @@ def __init__(self, interval, loop=None, serializers=None):

    def start(self, comm):
        self.comm = comm
+        self.please_stop = False
        self.loop.add_callback(self._background_send)

    def closed(self):


I still think this check is insufficient for telling whether we're closed or not. While we're in the process of closing (please_stop is True, or self.stopped is set, or _background_send is no longer running, etc.), it may still return True.

From the way this is used though, in both start and write, we probably want to treat "in the process of closing" as closed, not as running. Restarting a BatchedSend that's closing should be an error. If writing to a closed BatchedSend is an error, then so should be writing to one that's in the process of closing.

gjoseph92 · 2021-11-04T20:05:10Z

distributed/batched.py

-                payload = None  # lose ref
+                # If anything failed we should not lose payload. If a new comm
+                # is provided we can still resubmit messages
+                self.buffer = payload + self.buffer


After reading through TCP comm code, I don't think there's anywhere for an exception to happen after all of the payload has been sent. An exception could still happen when part has been sent and part hasn't, but either way, since we have to assume something here, I think it's more useful to assume that the payload hasn't been sent.

@jcrist and I were discussing this, and given the way the BatchedSend interface works, it actually needs to implement some sort of protocol with an ack from the receiver for each sent batch to guarantee messages can't be dropped. Since send on a BatchedSend is nonblocking, the caller is basically handing off full responsibility for the message to BatchedSend. If the message fails to send, it's too late to raise an error and let the caller figure out what to do about it—once we've been given a message, we have to ensure it's delivered. So the logical thing to do would be for the receiving side to ack each message, and only when ack'd does the sender drop the payload (with some deduplication of course).

OTOH there are lots of protocols out there for doing things like this, more performantly, robustly, and with better testing than we'll ever have. As with other things (framing in serialization), maybe the better solution is to stop duplicating functionality at the application level that should be the transport layer's job.

Could we get rid of BatchedSend entirely with some well-tuned TCP buffering settings + a serialization scheme that was more efficient for many small messages?

gjoseph92 · 2021-11-04T20:32:57Z

distributed/batched.py

+
+        if self.comm:
+            self.comm.abort()
+        yield self.close()


Why close and not abort?

It's weird that calling close from user code will trigger the coroutine to call back into close

Most of the things close does are already guaranteed to have happened by this point:

self.please_stop must be True to reach this line of code

self.stopped was just set a couple lines above

The comm was just aborted (not sure how that compares to await comm.close() though)

Not that abort is any better. The only thing it would do that hasn't already been done by this point is set self.comm = None.

I actually think we should move all shut-down/clean-up logic into the coroutine here, remove abort entirely, and have close() just be a function that tells the coroutine to please_stop and waits until it does. There's a lot of inconsistency from having shutdown logic in three different places.

gjoseph92 · 2021-11-04T20:38:42Z

distributed/batched.py

        self.buffer = []
+        self.please_stop = True
        self.waker.set()
        if not self.comm.closed():
            self.comm.abort()


comment for both here and close(): why should the BatchedSend close/abort the comm when it's closed? Does the caller expect that, when calling start, it's handed off lifecycle responsibilities for the comm to BatchedSend? If so, that should be documented ("once you call start, you must not directly use the comm object ").

gjoseph92 · 2021-11-04T20:44:32Z

distributed/comm/tcp.py

@@ -290,7 +292,15 @@ async def write(self, msg, serializers=None, on_error="message"):
                    stream._total_write_index += each_frame_nbytes

            # start writing frames
-            stream.write(b"")
+            await stream.write(b"")


I'm confused, why were we not awaiting this before?!

I traversed through the blames and it looks like https://github.com/dask/distributed/pull/661/files#diff-581957c552b88dd04319efca1429c0c0827daff509a115542e70b661c4c04914R235 5 years ago is where we stopped awaiting this future. There was even some conversation about it: #661 (comment), xref #653.

Seems like there were some issues around concurrent writes to the same stream from multiple coroutines producing orphaned futures. I wonder if that's still the case?

gjoseph92 · 2021-11-04T21:48:47Z

distributed/comm/tcp.py

+            await stream.write(b"")
+            # FIXME: How do I test this? Why is the stream closed _sometimes_?
+            # Diving into tornado, so far, I can only confirm that once the
+            # write future has been awaited, the entire buffer has been written
+            # to the socket. Not sure if one loop iteration is sufficient in
+            # general or just sufficient for the local tests I've been running
+            await asyncio.sleep(0)
+            if stream.closed():
+                raise StreamClosedError()


@fjetter to rephrase what you're saying:

awaiting the stream.write is a big improvement, because it means that comm.write doesn't return until the message has been successfully written to the socket buffer

That's still not enough to guarantee the message has actually been sent, because the socket could get closed after the message is handed off the buffer, but before it's actually sent over the network.

For a full fix we might need to reconsider the possibility for workers to reconnect statefully or introduce another way to guarantee message delivery.

I agree. Like I mentioned in another comment, I think we need a better protocol (either application-level or more careful use of TCP) to guarantee all messages are delivered through reconnects.

gjoseph92 · 2021-11-04T21:51:55Z

distributed/comm/tcp.py

@@ -333,6 +343,8 @@ def abort(self):
            stream.close()

    def closed(self):
+        if self.stream and self.stream.closed():
+            self.abort()


Seems odd to me that checking closed on a comm could have a side effect?

Well, technically this sets self.stream = None and sets _closed = True. I wouldn't consider this side effects

fjetter · 2021-11-29T11:16:25Z

I will close this PR. I am no longer convinced that any payload should be reused. Instead, I believe that the handshake upon reconnect should settle any state drift. Either way, this requires a bit more work upfront

Do not drop BatchedSend payload if worker reconnects

bad077e

fjetter force-pushed the dropped_payload_messages_worker_reconnect branch from b4f2855 to bad077e Compare October 25, 2021 08:23

fjetter requested a review from crusaderky October 25, 2021 09:56

crusaderky reviewed Oct 29, 2021

View reviewed changes

This was referenced Oct 29, 2021

Worker reconnection deadlock #5480

Closed

Properly support restarting BatchedSend #5481

Closed

gjoseph92 reviewed Nov 3, 2021

View reviewed changes

fjetter added 2 commits November 4, 2021 11:31

some docs and tests around batched

7adc0c0

some sleep in tcp

5795c5b

fjetter force-pushed the dropped_payload_messages_worker_reconnect branch from 291de5d to 5795c5b Compare November 4, 2021 17:54

fjetter commented Nov 4, 2021

View reviewed changes

gjoseph92 mentioned this pull request Nov 4, 2021

Release 2021.11.0 dask/community#197

Closed

gjoseph92 reviewed Nov 4, 2021

View reviewed changes

fjetter self-assigned this Nov 5, 2021

fjetter mentioned this pull request Nov 9, 2021

Deadlock - task not running #5366

Open

fjetter mentioned this pull request Nov 18, 2021

Resolve KeyError-related deadlock #5525

Merged

fjetter closed this Nov 29, 2021

gjoseph92 mentioned this pull request Jan 20, 2022

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

Open

fjetter mentioned this pull request Jan 21, 2022

Conditions under which a TCP connection may fail / close? #5678

Closed

gjoseph92 mentioned this pull request Apr 12, 2022

Include BatchedSend state in cluster dumps #6114

Open

This was referenced Apr 27, 2022

Dask Cluster Deadlocked #6228

Closed

Add fail_hard decorator for worker methods #6210

Merged

gjoseph92 mentioned this pull request May 20, 2022

Add validation to BatchedSend and convert to asyncio #6389

Open

	raise RuntimeError("Comm already closed.")
	raise RuntimeError(f"Tried to start BatchedSend with an already-closed comm: {comm!r}.")

	self.please_stop = False
	self.please_stop = False
	self.stopped.clear()

Do not drop BatchedSend payload if worker reconnects #5457

Do not drop BatchedSend payload if worker reconnects #5457

Conversation

fjetter commented Oct 22, 2021 • edited

fjetter commented Oct 22, 2021 • edited

crusaderky Oct 29, 2021 • edited

Choose a reason for hiding this comment

fjetter Oct 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Nov 29, 2021

fjetter commented Oct 22, 2021 •

edited

fjetter commented Oct 22, 2021 •

edited

crusaderky Oct 29, 2021 •

edited

fjetter Oct 29, 2021 •

edited