protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

mxinden · 2022-07-02T08:23:54Z

Description

Summary of the plot of the discover_peer_after_disconnect test:

swarm2 connects to swarm1.
swarm2 requests an identify response from swarm1.
swarm1 sends the response to swarm2.
swarm2 disconnects from swarm1.
swarm2 tries to disconnect.

Problem

libp2p-identify sets KeepAlive::No when it identified the remote. Thus swarm1 might
identify swarm2beforeswarm2identifiedswarm1. swarm1then setsKeepAlive::Noand thus closes the connection toswarm2beforeswarm2identifiedswarm1. In such case the unit test discover_peer_after_disconnect hangs indefinitely.

Solution

Add an initial delay to swarm1 requesting an identification from swarm2, thus ensuring swarm2
is always able to identify swarm1.

Links to any relevant issues

Fixes #2743

Open Questions

We could as well introduce libp2p-ping and set PingConfig::with_keep_alive. I think the fix here is sufficient though.

As another alternative, instead of setting KeepAlive::No once we identified the remote, we could as well set KeepAlive::Until(Instant::now() + Duration::from_secs(10)), thus giving the remote a chance to identify us.

diff --git a/protocols/identify/src/handler.rs b/protocols/identify/src/handler.rs
index af4d7f8c..10566847 100644
--- a/protocols/identify/src/handler.rs
+++ b/protocols/identify/src/handler.rs
@@ -168,7 +169,7 @@ impl ConnectionHandler for IdentifyHandler {
                 self.events.push(ConnectionHandlerEvent::Custom(
                     IdentifyHandlerEvent::Identified(remote_info),
                 ));
-                self.keep_alive = KeepAlive::No;
+                self.keep_alive = KeepAlive::Until(Instant::now() + Duration::from_secs(10));
             }
             EitherOutput::Second(()) => self.events.push(ConnectionHandlerEvent::Custom(
                 IdentifyHandlerEvent::IdentificationPushed,

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

**Summary** of the plot of the `discover_peer_after_disconnect` test: 1. `swarm2` connects to `swarm1`. 2. `swarm2` requests an identify response from `swarm1`. 3. `swarm1` sends the response to `swarm2`. 4. `swarm2` disconnects from `swarm1`. 5. `swarm2` tries to disconnect. **Problem** `libp2p-identify` sets `KeepAlive::No` when it identified the remote. Thus `swarm1` might identify` `swarm2` before `swarm2` identified `swarm1`. `swarm1` then sets `KeepAlive::No` and thus closes the connection to `swarm2` before `swarm2` identified `swarm1`. In such case the unit test `discover_peer_after_disconnect hangs indefinitely. **Solution** Add an initial delay to `swarm1` requesting an identification from `swarm2`, thus ensuring `swarm2` is always able to identify `swarm1`.

elenaf9 · 2022-07-02T10:05:36Z

Thanks @mxinden!

As another alternative, instead of setting KeepAlive::No once we identified the remote, we could as well set KeepAlive::Until(Instant::now() + Duration::from_secs(10)), thus giving the remote a chance to identify us.

Another idea:
If I understand it correctly, when two peers connect they both initiate their outbound substream request after the initial delay, which is usually the same for both peers. The current problem is just that the negotiation of one direction resolves already while the other one is still ongoing, and then the connection is closed, right?
What do you think about instead of having our keep_alive property we could have two flags is_inbound_ongoing and is_outbound_pending, and then implement

fn connection_keep_alive(&self) -> KeepAlive { if self.is_outbound_ongoing || self.is_inbound_ongoing { KeepAlive::Yes } else { KeepAlive::No } }

(is_outbound_ongoing would have to be set to true initially so that we don't close the connection in the initial delay.)

That way in the above case when swarm1 identified the remote, it would still wait for their own inbound request to resolve before closing.
Note that this does not fix it if the initial delay is different, or if one side resolves before the other side even got the chance to init the outbound subtream request. Maybe not a problem in real-world systems where the negotiation process takes much longer, but for our tests it might make sense to still additionally add your above fix within the test.

Edit: I just noticed that my above idea does not make sense, so please ignore.
I don't think we need the 10s delay outside of our test case. In real-world system we won't ever run into this issue, right? Assuming that both peers usually initiate their outbound substream request at the same time, one side will always first send out their own outbound substream request before receiving the inbound one and responding to it. So it is quite unlikely that one peer will receive the response to their outbound request, before receiving the remote's request, right?

mxinden · 2022-07-04T01:41:24Z

Edit: I just noticed that my above idea does not make sense, so please ignore.
I don't think we need the 10s delay outside of our test case. In real-world system we won't ever run into this issue, right? Assuming that both peers usually initiate their outbound substream request at the same time, one side will always first send out their own outbound substream request before receiving the inbound one and responding to it. So it is quite unlikely that one peer will receive the response to their outbound request, before receiving the remote's request, right?

I am assuming the same, and I have not seen an indication for this being a problem in the wild.

In case we ever do come across this, we can still introduce the KeepAlive::Until(10_SECONDS).

mxinden requested a review from elenaf9 July 2, 2022 08:23

elenaf9 approved these changes Jul 2, 2022

View reviewed changes

Merge branch 'master' into identify-flaky

116cdab

mxinden merged commit b28cdb3 into libp2p:master Jul 4, 2022

DieracDelta mentioned this pull request Jul 6, 2022

chore(deps): bump libp2p from 0.45.1 to 0.46.1 EspressoSystems/HotShot#278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

mxinden commented Jul 2, 2022 •

edited

elenaf9 commented Jul 2, 2022 •

edited

mxinden commented Jul 4, 2022

protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

protocols/identify: Fix race condition in discover_peer_after_disconnect #2744

Conversation

mxinden commented Jul 2, 2022 • edited

Description

Links to any relevant issues

Open Questions

Change checklist

elenaf9 commented Jul 2, 2022 • edited

mxinden commented Jul 4, 2022

mxinden commented Jul 2, 2022 •

edited

elenaf9 commented Jul 2, 2022 •

edited