Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditions under which a TCP connection may fail / close? #5678

Closed
fjetter opened this issue Jan 21, 2022 · 3 comments
Closed

Conditions under which a TCP connection may fail / close? #5678

fjetter opened this issue Jan 21, 2022 · 3 comments
Labels
discussion Discussing a topic with no specific actions yet

Comments

@fjetter
Copy link
Member

fjetter commented Jan 21, 2022

In a recent discussion around reconnecting clients, the question was raised about how reliable a TCP connection is and how reliable an unexpectedly closed connection can be interpreted as a dead remote.

Particularly, the question is raised whether we have or even should implement any stateful reconnect logic at all or the network layer is reliable enough for us to not do that.

This question assumes that TCP keepalive (distributed config, linux docs) is configured and the TCP User Timout is sufficiently large such that increased latencies, etc. can be effectively ignored.

I am aware of situations where firewall rules close idle connections regardless of the TCP keepalive if there are only empty packages submitted (see #2524 / #2907 / application level keep-alive).

Therefore, I would like to answer the following questions

  1. Is an unexpectedly closed TCP connection always an indication for a dead remote?
  2. If so, is this guarantee true for our entire python stack, starting from the python socket, up to the distributed Comm?
  3. If so, is this also true for short term network outages, e.g. in a cloud environment with a very short network outage (e.g. broken switch) such that outage << timeout
  4. Is it possible for the GIL to lock an asyncio python server down in such a way that it appears to be dead to the remote even though it isn't given that the TCP keepalive is turned on?

Answers to these questions will have a major impact on ongoing and future tickets. Below a few references

If all of the above is answered by "TCP or is reliable enough, we should not worry" we might have to investigate whether it is something we do to make our connections unreliable but before diving into this I would like to get an answer about the our network infrastructure.
E.g.

cc @crusaderky , @gjoseph92 , @graingert, @jcrist

@jcrist
Copy link
Member

jcrist commented Jan 21, 2022

Is an unexpectedly closed TCP connection always an indication for a dead remote?

Or a network failure somewhere along the line (which do happen often enough we'll want to be robust to them). This is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

Note that the inverse (a dead remote will always show up as a closed TCP connection) is definitely false - in some setups an idle connection can close on one side without the other side detecting it immediately - this is one reason why some active application-level heartbeating is necessary.

If so, is this guarantee true for our entire python stack, starting from the python socket, up to the distributed Comm?

TCP connection failures like this are detected at the OS level (you can even pause the process and keep the connection open). Python socket's are a pretty thin wrapper around the system socket, so the only real source of accidental connection closures would be at the distributed Comm level or above. I believe all errors in the comm classes themselves are handled and reported correctly, but I'm not sure about our RPC layer.

Is it possible for the GIL to lock an asyncio python server down in such a way that it appears to be dead to the remote even though it isn't given that the TCP keepalive is turned on?

TCP keepalives are at the OS layer, not the application, so from the TCP level the connection will appear fine. However, it is possible that a sufficiently overloaded python process may hit application level timeouts in the comms for connect/respond/whatever in which case it may appear "dead" (depending on how the application level protocol is handled).

@fjetter
Copy link
Member Author

fjetter commented Jan 21, 2022

Or a network failure somewhere along the line

Yes, that's what I had in mind when asking for 3.) In a discussion we had yesterday, there was some uncertainty around whether something like this can actually happen.

This is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

It does sound like we should look into something similar. Do you have an opinion there? I would imagine if our Comm objects (or RPC / whatever) will deal with this, we would need a lot less application code

Note that the inverse (a dead remote will always show up as a closed TCP connection) is definitely false - in some setups an idle connection can close on one side without the other side detecting it immediately -

Isn't this what the TCP timeout should catch w/out another application level timeout? The TCP timeout will send (empty) packages in case the connection is idling for a certain amount of time. If, after N attempts, no package was acknowledged, the connection is announced dead.

TCP keepalives are at the OS layer, not the application, so from the TCP level the connection will appear fine. However, it is possible that a sufficiently overloaded python process may hit application level timeouts in the comms for connect/respond/whatever in which case it may appear "dead" (depending on how the application level protocol is handled).

right, I think we're handling these application level timeouts properly. At least I am not aware of any TimeoutErrors which are interpreted as a CommClosed

his is why things like GRPC implement reconnect within the channel object (note that that reconnect isn't transparent, a network failure can still show up to the application layer during active RPCs).

It sounds like we need something like this as well?

@fjetter fjetter added the discussion Discussing a topic with no specific actions yet label Mar 23, 2022
@fjetter fjetter closed this as completed Mar 23, 2022
@mrocklin
Copy link
Member

Historically we used to try to wait to see if a worker would reconnect. What I found was that this introduced enough complexity that it made more sense to just assume that CommClosedError implied a dead worker. If the worker shows up a second later then great! we'll treat it as a new worker (or possibly a new worker that has some existing data).

This wasn't entirely true, but was safe, and resulted in a class of consistency errors going away.

Are there advantages to trying to keep things alive? The only advantage I see is that things might run a little faster in that case. If this is the only advantage then I'd suggest that we just let broken comms imply dead workers, and deal with the slowdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

3 participants