Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764

Open
Stebalien opened this issue Apr 11, 2024 · 18 comments
Open

Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764

Stebalien opened this issue Apr 11, 2024 · 18 comments

Comments

@Stebalien
Copy link
Member

We've seen reports of a chain-sync regression between lotus 1.25 and 1.26. Notably:

  1. We updated go-libp2p from v0.31.1 to v0.33.2.
  2. I've seen reports of peers failing to resume sync after transient network issues.
  3. Users are reporting "low" peer counts.

We're not entirely sure what's going on, but I'm starting an issue here so we can track things.

@Stebalien
Copy link
Member Author

Stebalien commented Apr 11, 2024

My first guess, given (2), is libp2p/specs#573 (comment). This is unconfirmed, but high on my list.

  • Test: does disabling tcp reuseport fix this?

@Stebalien
Copy link
Member Author

My second guess is #2650. This wouldn't be the fault of libp2p, but TLS may be more impacted by the GFW? That seems unlikely...

@Stebalien
Copy link
Member Author

My third guess is something related to QUIC changes.

@MarcoPolo
Copy link
Contributor

Have you been able to repro 2 or 3 locally?

  • For GFW theory, we could try connecting to peers over both tls and noise and seeing if there's a difference.
  • Can you run lotus 1.26 on the older version of go-libp2p and see if you still see any errors?
  • Is the transient network issue something that would affect my connectivity to everyone or only a subset of peers? e.g. is my internet down or is my connection to a subset down?
  • For a typical well behaved node, what's the breakdown in connection types? (TCP+TLS, QUIC, TCP+Noise). For a node seeing this regression, what is its breakdown?

@Stebalien
Copy link
Member Author

I can't repro this at the moment, unfortunately (not at home, node down). But I'll do some more digging later this week.

@Stebalien
Copy link
Member Author

Ok, I got one confirmation that disabling reuseport seems to fix the issue and one report that it makes no difference.

@Stebalien
Copy link
Member Author

Ok, that confirmation appeared to be a fluke. This doesn't appear to have been the issue

@sukunrt
Copy link
Member

sukunrt commented Apr 25, 2024

From eyeballing the commits, I can see that the major changes apart from WebRTC are

  • We've upgraded QUIC
  • Implemented Happy eyeballs for TCP
  • removed multistream simultaneous connect

Can we test this with an only QUIC node and an only TCP node to see if it's a problem with QUIC or TCP?

@Stebalien
Copy link
Member Author

I'll try. Unfortunately, the issue is hard to reproduce and tends to happen in production (hard to get people to run random patches). Right now we're waiting on goroutine dumps hoping to get a bit of an idea about what might be stuck (e.g., may not be libp2p).

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

It might be the silently broken PX -- see libp2p/go-libp2p-pubsub#555

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

I am almost certain this is the culprit as the bootstrap really relies on it.

@Stebalien
Copy link
Member Author

AH.. that would definitely explain it.

@MarcoPolo
Copy link
Contributor

I thought that could be it as well, but I was thrown off by the premise that this wasn't an issue in v0.31.1.

PX broke after this change: #2325 which was included in the v0.28.0 release. So v0.31.1 should have the same PX issue.

@vyzo
Copy link
Contributor

vyzo commented Apr 25, 2024

I cant imagine what else it could be.
Was there a recent "mandatory release" where everyone upgraded to the more recent libp2p?

@MarcoPolo
Copy link
Contributor

Users are reporting "low" peer counts.

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

@sukunrt
Copy link
Member

sukunrt commented Apr 25, 2024

Do we know if these nodes are running both QUIC and TCP? If yes, it's unlikely that the problem is with either transport and is probably at a layer above the go-libp2p transports?

@rjan90
Copy link

rjan90 commented May 3, 2024

Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to?

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

@MarcoPolo
Copy link
Contributor

Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:

lotus info
Network: mainnet
Peers to: [publish messages 105] [publish blocks 106]

On the previos version (0.33.1), it was stable around the 200 range.

I think these are the number of peers in your gossipsub topic mesh. A subset of the peers you are actually connected to. Could you find the number of peers you are connected to? And compare that between versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants