Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer closes connection before other peer has read all the data #5060

Open
ShahakShama opened this issue Jan 7, 2024 · 3 comments
Open

Peer closes connection before other peer has read all the data #5060

ShahakShama opened this issue Jan 7, 2024 · 3 comments

Comments

@ShahakShama
Copy link

Summary

Stream's poll_close function can return Ready on peer A before peer B has read all the data. and then if the connection closes due to not having an open stream on peer A, peer B doesn't get the event it should've gotten after reading the data (depends on which behaviour we're at) and gets a ConnectionClosed event instead.

Note that this only occurs when sending a lot of data.

I've created an executable to reproduce this in https://github.com/ShahakShama/libp2p_bug_example.
The executable implements the Request Response protocol and when writing a response it writes a lot of garbage bytes and when reading a response it reads a lot of garbage bytes.
In order to run this executable, in one terminal run the command
cargo run --release -- -l /ip4/127.0.0.1/tcp/11111 -m 1000000 -t 10
and in another terminal run
cargo run --release -- -l /ip4/127.0.0.1/tcp/22222 -d /ip4/127.0.0.1/tcp/11111 -m 1000000 -s -t 10
You'll might need to increase -m depending on your hardware. note that -m should be identical between both processes`

Expected behavior

I know that if I set a longer timeout using with_idle_connection_timeout then the issue is fixed, but I don't think it's intended that data gets lost if you don't set this timeout to a big enough value (If I'm wrong, I'd love to hear an explanation why)

IMO the connection should stay alive until the other peer got all the data that we've sent to it (e.g in TCP we got an ack message on all the data we've sent)

Actual behavior

The behaviour I'm seeing is that once our peer has sent all the data it may close the connection before it reached the other peer

Relevant log output

No response

Possible Solution

I don't know the inner implementation of libp2p well enough, but I think that there are a few areas in which to change the code:

  1. in Stream::poll_close, return Ready only when the other peer got all the data (as mentioned before with the TCP example)
  2. change the connection_keep_alive logic to check if the other peer got all the data
  3. Add some functionality to Stream to check if the other peer got all the data. Then, the behaviours that handle big messages can check if the stream can be safely dropped before dropping it (This will require to change the Request Response behaviour alongside other behaviours)

Version

0.53.2

Would you like to work on fixing this bug ?

Yes

@thomaseizinger
Copy link
Contributor

Which transport are you using? Does this issue occur with QUIC? On TCP, we are layering many technologies together, like yamux on top of noise etc. It may be that "flushing" isn't correctly forwarded for some of them.

Note that we cannot directly achieve what you are suggesting because it is a two-generals problem: The message that we have read all the data may get lost, or more generally: The last ACK can always get lost.

The only thing we can in theory ensure is that all data has been flushed to the network and it may be that this isn't true for a composed TCP stack.

This might be related in case you are using yamux: rust-lang/futures-rs#2746.

Your best chance is to use QUIC as there are less layers at work there.

@ShahakShama
Copy link
Author

I indeed used TCP. I swapped to QUIC and now it can process about 10 times more messages before this issue returns.
So raising the timeout and using QUIC are the only possible solutions for this issue? If yes, I think it should be documented somewhere that the connection may be closed before all the messages were flushed. Maybe here

@thomaseizinger
Copy link
Contributor

I indeed used TCP. I swapped to QUIC and now it can process about 10 times more messages before this issue returns. So raising the timeout and using QUIC are the only possible solutions for this issue? If yes, I think it should be documented somewhere that the connection may be closed before all the messages were flushed. Maybe here

I don't think it is right to describe bugs in the documentation :)

The issue is somewhat a combination of many (complicated aspects):

  • As I documented above, it is a fundamental distributed systems problem that you can never be fully sure that the other party has received your message. See https://en.wikipedia.org/wiki/Two_Generals%27_Problem.
  • Because of the above, the request-response abstraction cannot know when the response was received.
  • There may actually be bugs in our forwarding of flushing calls.

I'd suggest one or more of the following:

  • A non-zero default for connection timeout can generally be useful (depends on your application). Also see Set a better default for idle_connection_timeout #4912.
  • Re-design your application protocols such that the essential information is in the request, then receipt of the response acts as an ACK that the request was received.
  • If receipt of the response needs to be acknowledged, don't use request-response but design a three or more message protocol. feat: introduce libp2p-stream #5027 may be useful for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants