Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

Merged
merged 2 commits into from Jul 4, 2022

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Jul 2, 2022

Description

Summary of the plot of the discover_peer_after_disconnect test:

  1. swarm2 connects to swarm1.
  2. swarm2 requests an identify response from swarm1.
  3. swarm1 sends the response to swarm2.
  4. swarm2 disconnects from swarm1.
  5. swarm2 tries to disconnect.

Problem

libp2p-identify sets KeepAlive::No when it identified the remote. Thus swarm1 might
identify swarm2beforeswarm2identifiedswarm1. swarm1then setsKeepAlive::Noand thus closes the connection toswarm2beforeswarm2identifiedswarm1. In such case the unit test discover_peer_after_disconnect hangs indefinitely.

Solution

Add an initial delay to swarm1 requesting an identification from swarm2, thus ensuring swarm2
is always able to identify swarm1.

Links to any relevant issues

Fixes #2743

Open Questions

We could as well introduce libp2p-ping and set PingConfig::with_keep_alive. I think the fix here is sufficient though.

As another alternative, instead of setting KeepAlive::No once we identified the remote, we could as well set KeepAlive::Until(Instant::now() + Duration::from_secs(10)), thus giving the remote a chance to identify us.

diff --git a/protocols/identify/src/handler.rs b/protocols/identify/src/handler.rs
index af4d7f8c..10566847 100644
--- a/protocols/identify/src/handler.rs
+++ b/protocols/identify/src/handler.rs
@@ -168,7 +169,7 @@ impl ConnectionHandler for IdentifyHandler {
                 self.events.push(ConnectionHandlerEvent::Custom(
                     IdentifyHandlerEvent::Identified(remote_info),
                 ));
-                self.keep_alive = KeepAlive::No;
+                self.keep_alive = KeepAlive::Until(Instant::now() + Duration::from_secs(10));
             }
             EitherOutput::Second(()) => self.events.push(ConnectionHandlerEvent::Custom(
                 IdentifyHandlerEvent::IdentificationPushed,

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

**Summary** of the plot of the `discover_peer_after_disconnect` test:

1. `swarm2` connects to `swarm1`.
2. `swarm2` requests an identify response from `swarm1`.
3. `swarm1` sends the response to `swarm2`.
4. `swarm2` disconnects from `swarm1`.
5. `swarm2` tries to disconnect.

**Problem**

`libp2p-identify` sets `KeepAlive::No` when it identified the remote. Thus `swarm1` might
identify` `swarm2` before `swarm2` identified `swarm1`. `swarm1` then sets `KeepAlive::No` and thus closes the
connection to `swarm2` before `swarm2` identified `swarm1`. In such case the unit test
`discover_peer_after_disconnect hangs indefinitely.

**Solution**

Add an initial delay to `swarm1` requesting an identification from `swarm2`, thus ensuring `swarm2`
is always able to identify `swarm1`.
@mxinden mxinden requested a review from elenaf9 July 2, 2022 08:23
@elenaf9
Copy link
Contributor

elenaf9 commented Jul 2, 2022

Thanks @mxinden!

As another alternative, instead of setting KeepAlive::No once we identified the remote, we could as well set KeepAlive::Until(Instant::now() + Duration::from_secs(10)), thus giving the remote a chance to identify us.

Another idea:
If I understand it correctly, when two peers connect they both initiate their outbound substream request after the initial delay, which is usually the same for both peers. The current problem is just that the negotiation of one direction resolves already while the other one is still ongoing, and then the connection is closed, right?
What do you think about instead of having our keep_alive property we could have two flags is_inbound_ongoing and is_outbound_pending, and then implement

fn connection_keep_alive(&self) -> KeepAlive {
    if self.is_outbound_ongoing || self.is_inbound_ongoing {
        KeepAlive::Yes
    } else {
       KeepAlive::No
   }
}

(is_outbound_ongoing would have to be set to true initially so that we don't close the connection in the initial delay.)

That way in the above case when swarm1 identified the remote, it would still wait for their own inbound request to resolve before closing.
Note that this does not fix it if the initial delay is different, or if one side resolves before the other side even got the chance to init the outbound subtream request. Maybe not a problem in real-world systems where the negotiation process takes much longer, but for our tests it might make sense to still additionally add your above fix within the test.

Edit: I just noticed that my above idea does not make sense, so please ignore.
I don't think we need the 10s delay outside of our test case. In real-world system we won't ever run into this issue, right? Assuming that both peers usually initiate their outbound substream request at the same time, one side will always first send out their own outbound substream request before receiving the inbound one and responding to it. So it is quite unlikely that one peer will receive the response to their outbound request, before receiving the remote's request, right?

@mxinden
Copy link
Member Author

mxinden commented Jul 4, 2022

Edit: I just noticed that my above idea does not make sense, so please ignore.
I don't think we need the 10s delay outside of our test case. In real-world system we won't ever run into this issue, right? Assuming that both peers usually initiate their outbound substream request at the same time, one side will always first send out their own outbound substream request before receiving the inbound one and responding to it. So it is quite unlikely that one peer will receive the response to their outbound request, before receiving the remote's request, right?

I am assuming the same, and I have not seen an indication for this being a problem in the wild.

In case we ever do come across this, we can still introduce the KeepAlive::Until(10_SECONDS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

protocols/identify: Flaky discover_peer_after_disconnect
2 participants