New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764
Comments
My first guess, given (2), is libp2p/specs#573 (comment). This is unconfirmed, but high on my list.
|
My second guess is #2650. This wouldn't be the fault of libp2p, but TLS may be more impacted by the GFW? That seems unlikely... |
My third guess is something related to QUIC changes. |
Have you been able to repro 2 or 3 locally?
|
I can't repro this at the moment, unfortunately (not at home, node down). But I'll do some more digging later this week. |
Ok, I got one confirmation that disabling reuseport seems to fix the issue and one report that it makes no difference. |
Ok, that confirmation appeared to be a fluke. This doesn't appear to have been the issue |
From eyeballing the commits, I can see that the major changes apart from WebRTC are
Can we test this with an only QUIC node and an only TCP node to see if it's a problem with QUIC or TCP? |
I'll try. Unfortunately, the issue is hard to reproduce and tends to happen in production (hard to get people to run random patches). Right now we're waiting on goroutine dumps hoping to get a bit of an idea about what might be stuck (e.g., may not be libp2p). |
It might be the silently broken PX -- see libp2p/go-libp2p-pubsub#555 |
I am almost certain this is the culprit as the bootstrap really relies on it. |
AH.. that would definitely explain it. |
I thought that could be it as well, but I was thrown off by the premise that this wasn't an issue in v0.31.1. PX broke after this change: #2325 which was included in the v0.28.0 release. So v0.31.1 should have the same PX issue. |
I cant imagine what else it could be. |
Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to? |
Do we know if these nodes are running both QUIC and TCP? If yes, it's unlikely that the problem is with either transport and is probably at a layer above the go-libp2p transports? |
Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:
On the previos version (0.33.1), it was stable around the 200 range. |
I think these are the number of peers in your gossipsub topic mesh. A subset of the peers you are actually connected to. Could you find the number of peers you are connected to? And compare that between versions? |
We've seen reports of a chain-sync regression between lotus 1.25 and 1.26. Notably:
We're not entirely sure what's going on, but I'm starting an issue here so we can track things.
The text was updated successfully, but these errors were encountered: