Unstable connectivity and stream negotiation on test networks #1623

Wondertan · 2023-01-20T12:29:04Z

Context

We have two testing networks, Arabica and Mocha. Mocha has shown to be stable, while Arabica is still not. We have been fighting with issues on our bootstrappers of the Arabica network for about a month, and it's finally time to document what step we approached for future us.

Main issues

Mainly, unstableness manifested via

(1) failed to find any peer in table errors on node startup
(2) context.Deadline errors during Header Store initialization on node startup
(3) inability to sync headers when the previous two errors didn't appear.

Debug Sequence

Resource Manager

In the beginning, all the problems started with enabling ResourceManager in the libp2p host, which we presumed to be buggy(because of the libp2p discussions about broken auto-scaling in resource manager) on the version we were running(v0.20.1), so we disabled it, temporarily. Our logs were cluttered with unclear logs that streams were not established due to limits, and it wasn't clear where they were coming from until we updated our libp2p version, which included the fix to make logs less confusing.

Infra

Unfortunately, removing the resource manager and updating the libp2p version with less confusing resource manager errors didn't help us resolve unstableness, so we blamed the infrastructure due to Mocha working fine. Still, we had doubts over that as the Mocha and Arabica were running different node versions(The diff was not too big, but still).

We presumed that Arabica is a developer-focused network that should have a higher load. @sysrex tried vertically scaling the hardware of 3 bootstrappers to a massive size, which didn't help. Additionally, @sysrex analyzed resource usage on the machines, which was negligible. Finally, we compared the number of peers connected to bootstrappers on Mocha and Arabica. The results showed that Mocha has 10x more peers(yeah, we should have seen this earlier, and our network observability is in a poor state at this moment) with humble hardware for bootstrappers, so the conclusion is that this is def not a hardware issue.

Software

Next, we brought both networks to full parity in the setup and software version to see if something in the diff could explain it. However, Arabica was still unstable. Meanwhile, @Bidon15 and @Wondertan went full debug mode and found out that the reason behind (3) is a bug inside the new version of header ex, which was initially fixed by #1592 as an experiment and superseded with #1603.

Even we (3) being fixed, we still have (1) and (2) that are very likely to be software issues as well, which need to be reproducible in our testing environments(Testground(yet to be reconfirmed on v0.6.2). Both of those issues come from the deeper internals of libp2p.

The (1)failed to find any peer in table error is coming from RoutedHost, which fails to reconnect after the initial connection failure
The (2)context.Deadliner error is coming from deadlocked new stream creation, which is either blocked on "identify protocol" that cannot negotiate with the remote side.

We think those two issues are interconnected and describe the unstable connectivity. Version updates from v0.20.1 to v0.23.1 during debugging attempts could also contribute to this. Additionally, we contacted libp2p maintainers to support us in debugging this issue. It could be either a misconfiguration from our side or an actual bug inside libp2p(not yet found by other teams relying on it).

Back to Infra

Meanwhile, @Bidon15 also found two other issues that contributed to the above issues (we are still automating everything, and manual work is prone to human error)

Outdated images
One of the bootstrappers wasn't correctly connected to the app node so it couldn't serve the headers.

We also deployed an additional bootstrapper, Kaarina #1619

Current work

The current focus is on debugging libp2p and seeing if version differences contribute to the issues we face, refining bootstrapper infrastructure, and improving debuggability by logs aggregation by @sysrex.

The text was updated successfully, but these errors were encountered:

Wondertan · 2023-01-20T12:31:50Z

At this point, I could fully isolate my BN from libp2p peers and only connect it to the app node with a local LN that is only connected to BN. In this environment, I can see precisely a 50% chance of successful connectivity after each node start. Debugging further down the call stack

MSevey · 2023-01-20T16:42:21Z

mentioned in slack but copying here as well.

Let's add links to any relevant tests. If we haven't already, we should revisit these test now that we have a more defined problem statement and see if there is a way to extend/expand the test to replicate.

Bidon15 · 2023-01-20T17:35:55Z

Since v0.6.0 onwards(latest is v0.6.2), testground tests were not able to replicate the connectivity issue we are experiencing in arabica testnet

I've added udp/quic-v1 in parallel with tcp to make node cfg as close to prod as possible. PR: celestiaorg/test-infra#146

I was checking with running big network tests that included:

In both cases no instability was observed (the tests would obviously failed miserably at startup) in testground k8s.

liamsi · 2023-01-23T15:14:25Z

regarding:

(2) context.Deadline errors during Header Store initialization on node startup

what is that default deadline? Did we try to increase it? My suspicion: either this is too short and depending on IO might bubble up, or, fx fires up several go-routines which lead to a deadlock-like scenario sometimes (which you also suspected above).

(3) inability to sync headers when the previous two errors didn't appear.

Is this box checked because it was fixed already?

MSevey · 2023-01-23T18:21:35Z

What unit or integration test, local to the node repo, cover these bug areas?

renaynay · 2023-01-24T12:50:01Z

I wanted to give some further context on the issue (and how it manifests for a user):

Case 1:
User connects to either network (mocha or arabica) with a fresh instance of a node (light or full) and receives a context deadline exceeded error which spits out that the initStore hook failed, meaning that the node was unable to fetch the block by the genesis hash from any bootstrapper of the hardcoded bootstrappers.

Case 2:
User connects to either network (mocha or arabica) with a node (light or full) that is NOT fresh (meaning has already synced some blocks to its datastore) and receives a context deadline exceeded error which spits out that the initStore hook failed, but the error here is incorrect because the failure actually occurs inside sync.Start where the node is unable to fetch a new network head from any of the hardcoded bootstrappers.

Case 3:
User connects to either network (mocha or arabica) with a node (light or full), fresh or not, and fails to start due to no peers in routing table.

All three of the above cases occur even though we've queried the bootstrappers and they only have about 70-80 peers or so when this problem is observed IIRC.

Wondertan · 2023-01-24T13:15:35Z

@liamsi

Is this box checked because it was fixed already?

Yes

what is that default deadline? Did we try to increase it? My suspicion: either this is too short and depending on IO might bubble up, or, fx fires up several go-routines which lead to a deadlock-like scenario sometimes (which you also suspected above).

10 seconds, but increasing it does not help. In can block for an hour

Wondertan · 2023-01-24T13:17:47Z

@MSevey

What unit or integration test, local to the node repo, cover these bug areas?

Every swamp test relies on successful connection and/or stream negotiation.

MSevey · 2023-01-24T20:15:00Z

@Wondertan

Every swamp test relies on successful connection and/or stream negotiation.

yes but are those connections made under similar conditions? Specifically, do we have a test that has a cluster of nodes running that are connected and synced to some height, and then a new node is added and tries to sync and catch up?

Wondertan · 2023-01-24T23:38:15Z

We have two tests with Swamp where a node is catching up. Not with a cluster, but one FN or BN. For such clustering we rely on Testground, which we eventually can get on CI. We could add it to the swamp as well, but I am a bit skeptical that it will show us root cause.

After we find the issue, we will definitely write a test to assert this issue never happens again.

staheri14 · 2023-01-25T00:54:01Z

It might be caused by the following libp2p issue libp2p/go-libp2p#1987 due to which the ConnectionGater on the bootstrap nodes refuses incoming connections in one of its methods that implements the ConnectionGater interface specially the InterceptSecured(). This means the connection is actually refused by the server, yet the querying node assumes the connection is established and its further queries time out i.e., emitting context deadline exceeded.
To verify whether this is the root cause, you may implement a wrapper around the libp2p BasicConnectionGater which is used in your code, and additionally log the state of connections.

Wondertan · 2023-01-26T15:03:39Z

The core issue was with our discovery wrapper running on the Server, which has a loop that calls FindPeers and rcvs EvtPeerConnectednessChanged notifications.

However, the loop called the FindPeers without any timeout on the context and blocked forever, avoiding reads out of the event subscription -> making subscription channel full -> blocking the event emission -> blocking new connection handling logic. At this point, the connection is ready and only has to run the start method to accept inbound streams, but it wasn't run.

All this explains why the Server could make an outbound stream to the client while the client did not. This also explains why this happens only when the Server has many connections(the subscriptions channel gets full of events). Additionally, this may probably explain why the issue happened only on the Arabica network because the network doesn't have enough full and bridge nodes, thus making FIndPeers block forever and blocking the whole connection flow.

The fix is in #1639. However, tests show that it is not enough, and this doesn't entirely fix the connectivity, so we should continue our investigations. One of the ideas is to increase the buffer size of the subscription through options and investigate other places where EvtConnectednessChanged is used(edit: implemented in the same fixing PR).

Wondertan · 2023-01-28T19:29:49Z

So we were able to "stabilize" network connectivity via:

Server-side quick hotfix: deps: temporary replace go-libp2p with the custom fork containing a hotfx #1646
Client-side feat: fix(header/p2p): add trivial retrying for trustedPeer requested #1647

Stabilize because nodes are able to connect and initialize from a single start, but in quotes because we still observe that some bootstrappers are hesitant to receive connections.

Wondertan · 2023-01-28T19:37:36Z

Also, I've made a warning log, as discussed with the libp2p maintainer, in case events are not read out by the application, s.t. future developers can see find bug much quicker than us.

Wondertan · 2023-01-28T19:44:27Z

Note that issues (1) and (2) have the same root(inability to finish connection because of the block on IdentifyWait as per #1623 (comment)) but manifest differently. The failed to find any peer in table is coming from the routed host here and happens when we fail to connect and then routed host tries to find the address for the peer, but we are not connected at all and routing table is empty, thus the error.

We may want to make a PR to make the routed host return the original connection error instead or both, so it's clear these are the same issue/error context deadline exceeded coming from IdentifyWait.

Anyway, we should have a retry mechanism for requests here(but long term, it should be better than this implementation), and also, this is helping with #1623 during the first initialization of a node.

@Wondertan

As #1684 resolves instability with our bootstrappers, we no longer need to depend on @Wondertan's fork of libp2p Related to #1623

Provides `PIDStore` to header module so that it can be used in `peerTracker` and replaces mem `peerstore.Peerstore` with on-disk `peerstore.Peerstore` so that `peerTracker` can quickly bootstrap itself with previously-seen peers and allow syncer to initialise its sync target from tracked peers rather than trusted so long as it has a subjective head within the trusting period. Overrides #2133 Closes #1851, mitigates issues resulting from #1623 Swamp integration tests to follow (tracked in #2506) ### Future note: This PR introduces a soon-to-be deprecated feature from libp2p (on-disk peerstore). Once libp2p deprecates and removes this feature, the PIDStore will have to become a PeerAddrStore such that it can save addr info of good peers to disk instead of just their IDs.

Wondertan · 2023-11-26T20:11:48Z

This was fixed a while ago. The issue was with the slow event bus reader that made the whole libp2p swarm to stall.

Wondertan added the bug Something isn't working label Jan 20, 2023

This was referenced Jan 20, 2023

Error: module/header: misconfiguration of syncer: invalid trusting period duration: 0s #1616

Closed

Arabica light node fails to run #1506

Closed

renaynay assigned Wondertan and renaynay Jan 20, 2023

Bidon15 mentioned this issue Jan 20, 2023

chore: bump versions to latest main for node and app v0.12.0-rc2 celestiaorg/test-infra#146

Merged

5 tasks

renaynay added the area:p2p label Jan 20, 2023

Wondertan mentioned this issue Jan 28, 2023

deps: temporary replace go-libp2p with the custom fork containing a hotfx #1646

Merged

Wondertan mentioned this issue Jan 28, 2023

fix(header/p2p): add trivial retrying for trustedPeer requested #1647

Merged

Wondertan mentioned this issue Feb 1, 2023

deps: use mainstream libp2p #1675

Closed

renaynay mentioned this issue Feb 8, 2023

chore(go.mod): Depend on upstream libp2p (AKA remove hlibp2p) #1700

Merged

renaynay added a commit that referenced this issue Feb 8, 2023

chore(go.mod): Depend on upstream libp2p (AKA remove hlibp2p) (#1700)

abb9fa1

As #1684 resolves instability with our bootstrappers, we no longer need to depend on @Wondertan's fork of libp2p Related to #1623

renaynay mentioned this issue Jul 21, 2023

feat(nodebuilder/header): Bootstrap from previously seen peers #2507

Merged

Wondertan closed this as completed Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable connectivity and stream negotiation on test networks #1623

Unstable connectivity and stream negotiation on test networks #1623

Wondertan commented Jan 20, 2023 •

edited

Wondertan commented Jan 20, 2023

MSevey commented Jan 20, 2023

Bidon15 commented Jan 20, 2023

liamsi commented Jan 23, 2023

MSevey commented Jan 23, 2023

renaynay commented Jan 24, 2023

Wondertan commented Jan 24, 2023

Wondertan commented Jan 24, 2023

MSevey commented Jan 24, 2023

Wondertan commented Jan 24, 2023

staheri14 commented Jan 25, 2023

Wondertan commented Jan 26, 2023 •

edited

Wondertan commented Jan 28, 2023 •

edited

Wondertan commented Jan 28, 2023

Wondertan commented Jan 28, 2023

Wondertan commented Nov 26, 2023

Unstable connectivity and stream negotiation on test networks #1623

Unstable connectivity and stream negotiation on test networks #1623

Comments

Wondertan commented Jan 20, 2023 • edited

Context

Main issues

Debug Sequence

Resource Manager

Infra

Software

Back to Infra

Current work

Wondertan commented Jan 20, 2023

MSevey commented Jan 20, 2023

Bidon15 commented Jan 20, 2023

liamsi commented Jan 23, 2023

MSevey commented Jan 23, 2023

renaynay commented Jan 24, 2023

Wondertan commented Jan 24, 2023

Wondertan commented Jan 24, 2023

MSevey commented Jan 24, 2023

Wondertan commented Jan 24, 2023

staheri14 commented Jan 25, 2023

Wondertan commented Jan 26, 2023 • edited

Wondertan commented Jan 28, 2023 • edited

Wondertan commented Jan 28, 2023

Wondertan commented Jan 28, 2023

Wondertan commented Nov 26, 2023

Wondertan commented Jan 20, 2023 •

edited

Wondertan commented Jan 26, 2023 •

edited

Wondertan commented Jan 28, 2023 •

edited