Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libp2p Pubsub kicks all of my peers after upgrading from 0.8.0 to 0.9.3 #530

Open
Giulio2002 opened this issue Mar 27, 2023 · 14 comments
Open

Comments

@Giulio2002
Copy link

Copypasted from the go-libp2p issue

Hello, I am from the Erigon team, one of the Ethereum clients, I am trying to build a consensus client and I am having issues after upgrading libp2p-pubsub to 0.9.3. What seems to happen is that after the upgrade I lose all the peers I have. I found a work around that seems to somehow fix this as long as I do not do any request which is calling host.RemoveStreamHandler("/p2p/id/delta/1.0.0") . If I do that when I only subscribe and publish to gossip it seems to not give any issue, however if I repeatedly make requests to the network, I will get kicked out from the network (or at least lose all peers and be unable to reconnect).

I get kicked out after a while from the ETH 2.0 network after upgrading libp2p version.

I first posted this issue on the go-libp2p github but then I found out that disabling pubsub fixed the issue for seemingly no reason so now it is here. the original issue and discussion is at: libp2p/go-libp2p#2202 . @MarcoPolo

@vyzo
Copy link
Collaborator

vyzo commented Mar 27, 2023

Its a go-libp2p problem, not a pubsub bug.

@vyzo
Copy link
Collaborator

vyzo commented Mar 27, 2023

Hrm, it might be relates to the notifiee trying to open a stream?
Hard to tell what's going on here.

@Giulio2002
Copy link
Author

Giulio2002 commented Mar 27, 2023

Disabling pubsub does not make my peers get kicked out so I think it is pubsub or you mean something else? What I know is that after the upgrade with no change of config whatsoever, this issue arose. I would also like to note that disabling gossip publishing does NOT fix the issue.

@vyzo
Copy link
Collaborator

vyzo commented Mar 27, 2023

pubsub doesn't do anything out of the ordinary, so it is libp2p bug triggered by pubsub probably.

@Giulio2002
Copy link
Author

uhm, okay so I will try to get the debug logs for @MarcoPolo then, I think that is my only choice in light of this plot twist

@iulianpascalau
Copy link
Contributor

iulianpascalau commented Jun 20, 2023

I think this also happened to me also. Trying to bump from v0.8.1 to v0.9.3.
When comparing this v0.8.1...v0.9.3 I've stumbled across this
image
which I have trouble understanding.
Before, it was ok, if the read bytes could have not been able to be unmarshalled to a rpc.RPC then the peer was considered dead. But now we have _, err := s.Read([]byte{0}). This can actually happen in a normal case where the pubsub fetched 1 byte. err is nil, a log debug that nobody reads appeared but then the peer was considered dead.

@vyzo
Copy link
Collaborator

vyzo commented Jun 20, 2023

Pubsub streams are unidirectional.

Prevously it would read incoming data and ignore them, now it resets the stream, which makes it stricter.

Are you interoping with a faulty peer?

@iulianpascalau
Copy link
Contributor

iulianpascalau commented Jun 20, 2023

no, just a peer that uses the following libp2p libs versions:

github.com/libp2p/go-libp2p v0.22.0
github.com/libp2p/go-libp2p-kad-dht v0.18.0
github.com/libp2p/go-libp2p-kbucket v0.4.7
github.com/libp2p/go-libp2p-pubsub v0.8.1
github.com/multiformats/go-multiaddr v0.6.0

We have a go-libp2p-pubsub fork and now we are trying with the reverted code for the handlePeerDead function to track if that is the cause. I've suspected pubsub lib for the following reason:
The MultiversX node (ex Elrond Network) handles in 2 way the connected peers for various tasks:

  1. the connections returned by the host.Network().Conns() function are handled in the network sharding subsystem
  2. the pubsub.ListPeers() function is used to determine the correct connected peers to request data from (since we are a sharding protocol).
    With v0.9.3 we've seen some discrepancies between what the host.Network().Conns() function returned versus the pubsub.ListPeers() when running a network with half of the nodes on the lib versions from above (older libs) and half of the network on the latest libp2p libs versions and go compiler v1.20.3.

By discrepancies, I mean the host.Network().Conns() returns a list of connections the node has, pubsub.ListPeers() returns 0 connected peers on topics that normally should have peers.

@iulianpascalau
Copy link
Contributor

running with debug logs enabled, I've stumbled across this section when the discrepancies between the host's network and pubsub became evident:

2023-06-23T09:17:15.225Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/gossipsub.go:526	PEERUP: Add new peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN using /meshsub/1.1.0
2023-06-23T09:17:15.225Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/comm.go:184	writing message to 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN: stream reset
2023-06-23T09:17:15.225Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/gossipsub.go:555	PEERDOWN: Remove disconnected peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.234Z	DEBUG	basichost	basic/basic_host.go:737	host 16Uiu2HAmS9igZyR75wKEJ9GgpgtRGTKZqLxiSJmenqaRQ8HsgNfC dialing 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.241Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/pubsub.go:695	already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.246Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/pubsub.go:695	already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.252Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/pubsub.go:695	already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.304Z	DEBUG	pubsub	go-libp2p-pubsub@v0.9.3/validation.go:476	validation done; took 264.307µs

I think the PEERDOWN event shouldn't have happened 🤔

@vyzo
Copy link
Collaborator

vyzo commented Jun 23, 2023

Interesting, seems like an issue with multiple connections.
Maybe a bug with the reset handling, can you investigate further around there?

@iulianpascalau
Copy link
Contributor

Yes, it takes ~1h-2h for each test but we will pursue finding what is wrong.
I will add some more logs on our pubsub fork and we will post here the results. The work will be carried out in this branch multiversx#11

@iulianpascalau
Copy link
Contributor

I have this sequence on 2 peers that are both running the pubsub 0.9.3 version. Something went wrong on the muxer (please note that besides pubsub, we've implemented a custom direct-send protocol that makes use of separate streams):

PID 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG logs:

DEBUG[2023-06-23 16:03:48.496]   refreshed peers on topic                 topic = unsignedTransactions_1_2 connected peers =  
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:526	PEERUP: Add new peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN using /meshsub/1.1.0
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:170	trying to write message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN...
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:151	peer dead 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset, buff: 00
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:172	message wrote to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:188	writing message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN: stream reset
2023-06-23T16:03:48.734Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:555	PEERDOWN: Remove disconnected peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN

PID 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN logs:

2023-06-23T16:03:48.736Z	DEBUG	pubsub	go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:115	opening new stream to peer: stream reset16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z	DEBUG	basichost	basic/basic_host.go:413	protocol mux failed: stream reset (took 1.052892ms)
2023-06-23T16:03:48.736Z	DEBUG	basichost	basic/basic_host.go:754	host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN finished dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z	DEBUG	dht	go-libp2p-kad-dht@v0.23.0/query.go:554	connected. dial success.
2023-06-23T16:03:48.736Z	DEBUG	basichost	basic/basic_host.go:737	host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z	DEBUG	basichost	basic/basic_host.go:433	negotiated: /meshsub/1.1.0 (took 1.304288ms)

will try to dig further based on these logs

@iulianpascalau
Copy link
Contributor

After trying with the new go-libp2p version v0.28.1 everything worked fine: no more connectedness discrepancies between the pubsub and the host, or at least not detectable. Probably the streams do not reset after normal message write/read.
We also tested the backward compatibility with the pubsub v0.8.0 & go-libp2p 0.22.0 and we did not find any issues.
The final go.mod imports that we use and found that are working properly are these:

github.com/gogo/protobuf v1.3.2
github.com/ipfs/go-log v1.0.5
github.com/jbenet/goprocess v0.1.4
github.com/libp2p/go-libp2p v0.28.1
github.com/libp2p/go-libp2p-kad-dht v0.23.0
github.com/libp2p/go-libp2p-kbucket v0.6.3
github.com/libp2p/go-libp2p-pubsub v0.9.3
github.com/multiformats/go-multiaddr v0.9.0

@vyzo
Copy link
Collaborator

vyzo commented Jun 29, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants