Conditions under which a TCP connection may fail / close? #5678

fjetter · 2022-01-21T13:11:28Z

In a recent discussion around reconnecting clients, the question was raised about how reliable a TCP connection is and how reliable an unexpectedly closed connection can be interpreted as a dead remote.

Particularly, the question is raised whether we have or even should implement any stateful reconnect logic at all or the network layer is reliable enough for us to not do that.

This question assumes that TCP keepalive (distributed config, linux docs) is configured and the TCP User Timout is sufficiently large such that increased latencies, etc. can be effectively ignored.

I am aware of situations where firewall rules close idle connections regardless of the TCP keepalive if there are only empty packages submitted (see #2524 / #2907 / application level keep-alive).

Therefore, I would like to answer the following questions

Is an unexpectedly closed TCP connection always an indication for a dead remote?
If so, is this guarantee true for our entire python stack, starting from the python socket, up to the distributed Comm?
If so, is this also true for short term network outages, e.g. in a cloud environment with a very short network outage (e.g. broken switch) such that outage << timeout
Is it possible for the GIL to lock an asyncio python server down in such a way that it appears to be dead to the remote even though it isn't given that the TCP keepalive is turned on?

Answers to these questions will have a major impact on ongoing and future tickets. Below a few references

If all of the above is answered by "TCP or is reliable enough, we should not worry" we might have to investigate whether it is something we do to make our connections unreliable but before diving into this I would like to get an answer about the our network infrastructure.
E.g.

Is it intended that any error from a handler makes Server.handle_stream close the comm? #5483

cc @crusaderky , @gjoseph92 , @graingert, @jcrist

The text was updated successfully, but these errors were encountered:

jcrist · 2022-01-21T14:17:27Z

Is an unexpectedly closed TCP connection always an indication for a dead remote?

Or a network failure somewhere along the line (which do happen often enough we'll want to be robust to them). This is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

Note that the inverse (a dead remote will always show up as a closed TCP connection) is definitely false - in some setups an idle connection can close on one side without the other side detecting it immediately - this is one reason why some active application-level heartbeating is necessary.

If so, is this guarantee true for our entire python stack, starting from the python socket, up to the distributed Comm?

TCP connection failures like this are detected at the OS level (you can even pause the process and keep the connection open). Python socket's are a pretty thin wrapper around the system socket, so the only real source of accidental connection closures would be at the distributed Comm level or above. I believe all errors in the comm classes themselves are handled and reported correctly, but I'm not sure about our RPC layer.

Is it possible for the GIL to lock an asyncio python server down in such a way that it appears to be dead to the remote even though it isn't given that the TCP keepalive is turned on?

TCP keepalives are at the OS layer, not the application, so from the TCP level the connection will appear fine. However, it is possible that a sufficiently overloaded python process may hit application level timeouts in the comms for connect/respond/whatever in which case it may appear "dead" (depending on how the application level protocol is handled).

fjetter · 2022-01-21T14:54:25Z

Or a network failure somewhere along the line

Yes, that's what I had in mind when asking for 3.) In a discussion we had yesterday, there was some uncertainty around whether something like this can actually happen.

This is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

It does sound like we should look into something similar. Do you have an opinion there? I would imagine if our Comm objects (or RPC / whatever) will deal with this, we would need a lot less application code

Note that the inverse (a dead remote will always show up as a closed TCP connection) is definitely false - in some setups an idle connection can close on one side without the other side detecting it immediately -

Isn't this what the TCP timeout should catch w/out another application level timeout? The TCP timeout will send (empty) packages in case the connection is idling for a certain amount of time. If, after N attempts, no package was acknowledged, the connection is announced dead.

TCP keepalives are at the OS layer, not the application, so from the TCP level the connection will appear fine. However, it is possible that a sufficiently overloaded python process may hit application level timeouts in the comms for connect/respond/whatever in which case it may appear "dead" (depending on how the application level protocol is handled).

right, I think we're handling these application level timeouts properly. At least I am not aware of any TimeoutErrors which are interpreted as a CommClosed

his is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

It sounds like we need something like this as well?

mrocklin · 2022-04-12T16:31:43Z

Historically we used to try to wait to see if a worker would reconnect. What I found was that this introduced enough complexity that it made more sense to just assume that CommClosedError implied a dead worker. If the worker shows up a second later then great! we'll treat it as a new worker (or possibly a new worker that has some existing data).

This wasn't entirely true, but was safe, and resulted in a class of consistency errors going away.

Are there advantages to trying to keep things alive? The only advantage I see is that things might run a little faster in that case. If this is the only advantage then I'd suggest that we just let broken comms imply dead workers, and deal with the slowdown.

fjetter added the discussion Discussing a topic with no specific actions yet label Mar 23, 2022

fjetter closed this as completed Mar 23, 2022

gjoseph92 mentioned this issue Apr 8, 2022

Worker <-> Worker Communication Failures bring Cluster in inconsistent State #5951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditions under which a TCP connection may fail / close? #5678

Conditions under which a TCP connection may fail / close? #5678

fjetter commented Jan 21, 2022 •

edited

jcrist commented Jan 21, 2022

fjetter commented Jan 21, 2022

mrocklin commented Apr 12, 2022

Conditions under which a TCP connection may fail / close? #5678

Conditions under which a TCP connection may fail / close? #5678

Comments

fjetter commented Jan 21, 2022 • edited

jcrist commented Jan 21, 2022

fjetter commented Jan 21, 2022

mrocklin commented Apr 12, 2022

fjetter commented Jan 21, 2022 •

edited