Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers #12

Closed
mudler opened this issue Feb 24, 2022 · 17 comments

Comments

@mudler
Copy link
Owner

mudler commented Feb 24, 2022

After about a 30min of usage, I started to notice constants connection drops by peers node. The issue seems to be persisting as connections doesn't seems to be rebuilt between nodes automatically, leaving peers disconnected. the only workaround is restarting the service.

This seems to be tied with the recent libp2p bump to 0.18.0-rc5. I'm not sure if it's due to rsmngr configuration or either something else. I can't still trace it, but this is what I'm seeing now at a behavioral level:

while opening a bunch of multiple streams to a single connection the connection gets eventually killed and seems the node can't recover and connect to it again.

Although this seems to be an issue even with small streams - where I was previously pushing GBs of traffic just fine between nodes, now doesn't hold even for simple http requests.

@vyzo / @marten-seemann sorry guys to ping you directly again, and don't want to sound annoying either. I'm seeing weird issues with 0.18.0 -rc5 here. I'm not sure if it's due to rsmngr configuration or either something else. I can't still trace it and give some helpful debug information, but this is what I'm seeing now at a behavioral level, the effect is quite noticeable.

mudler added a commit that referenced this issue Feb 24, 2022
It seems there are issues with the new rc regarding connections.
Meanwhile trying to figure out what's wrong downgrade to last good
version.

See #12
@vyzo
Copy link

vyzo commented Feb 24, 2022

keep us in the loop, v18 is an important release and we want to iron out all issues.

Are you using bitswap by any chance?

Another pointer is that i suspect there might be some bug in yamux that makes it incapable of responidng correctly to refusal to increase the window, but thats still only a theory at this point.

@mudler
Copy link
Owner Author

mudler commented Feb 24, 2022

keep us in the loop, v18 is an important release and we want to iron out all issues.

Sure will do 👍 , thanks!

Are you using bitswap by any chance?

Nope, things here are relatively much more simple as we just send over one block to the nodes (don't implement any real PoW, but just using it as a sync mechanism) and there is no block syncing (yet?)... so it is more tight to libp2p core modules and simple pub/sub mechanism which are just extensions of libp2p samples

Another pointer is that i suspect there might be some bug in yamux that makes it incapable of responidng correctly to refusal to increase the window, but thats still only a theory at this point.

I'll keep my eyes open there, thanks for the hint!

@mudler mudler changed the title Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 Intermittent connection drops with Edgevpn 0.10.0/libp2p 0.18.0-rc5 leaves disconnected peers Feb 24, 2022
mudler added a commit to kairos-io/kairos that referenced this issue Feb 24, 2022
@vyzo
Copy link

vyzo commented Feb 25, 2022

can you also check whether mplex is involved?
You probably dont need it at all, can you try limiting the muxer to just yamux?

@mudler
Copy link
Owner Author

mudler commented Feb 25, 2022

can you also check whether mplex is involved? You probably dont need it at all, can you try limiting the muxer to just yamux?

I'll give it a shot and try to collect as much info as possible, thanks for the pointers! The fact that nodes can't re-establish a connection afterwards should help trace it, I'll capture logs with libp2p component with debug loglevel and try to getting them in that exact moment to have a clearer picture of what's going on

@vyzo
Copy link

vyzo commented Feb 27, 2022

can you try either disabling mplex or with libp2p/go-mplex#99 ?

@mudler
Copy link
Owner Author

mudler commented Feb 27, 2022

can you try either disabling mplex or with libp2p/go-mplex#99 ?

Going to try that! Thanks ! Although I can test only later in the day as I'm afk now, letting you know as soon as I am at it and keeping you in the loop

@mudler
Copy link
Owner Author

mudler commented Mar 1, 2022

I'm sorry I didn't had time to get back at this yet during the weekend, I have still to setup my test environment to reproduce the issue as it is time-consuming process to do that manually (I observed this while set up kubernetes clusters on top of it, and it's the straightforward way for me to reproduce it). I'll look at it during the week and keep you posted

@mudler
Copy link
Owner Author

mudler commented Mar 4, 2022

I'm following up the discussions on the PRs, will cut down later a specific version with libp2p/go-libp2p#1350 and check it out

mudler added a commit that referenced this issue Mar 4, 2022
@mudler
Copy link
Owner Author

mudler commented Mar 5, 2022

I'm trying to setup a small automated test that I'm running on GHA to be able to narrow it down. It seems the problem is still there (https://github.com/mudler/edgevpn/runs/5432147596?check_suite_focus=true ) I'm trying to send over a file of 2GB between two nodes in the above.

I will enhance it to able to collect pprof and libp2p debug logs too so to have a better view of it. This could have been also something flaky, the setup of the test right now is really simplicistic (at the moment is just bashism so it is a bit hard to debug. will move it to golang soon so I can make it more interesting scenario)

@vyzo
Copy link

vyzo commented Mar 5, 2022

you can also get logs with GOLOG_LOG_LEVEL=debug, there should be some hints there.

@mudler
Copy link
Owner Author

mudler commented Mar 16, 2022

I just did a test in my homelab with multiple VMs and everything seems good here! I'll cut a rel and test it a bit more in a bigger scenario. Up to now the connection on the nodes seems back to stable, and no drops anymore at all! I'll keep you posted if I notice something strange

@mudler mudler closed this as completed Mar 17, 2022
@mudler
Copy link
Owner Author

mudler commented Mar 17, 2022

I've cut v0.11.0 with libp2p 0.18.0-rc6, thanks! will keep you in the loop if I spot something

@vyzo
Copy link

vyzo commented Mar 17, 2022

great, thank you!

@mudler mudler reopened this Mar 17, 2022
@mudler
Copy link
Owner Author

mudler commented Mar 17, 2022

Alright, seems while testing it on a bigger scale I'm seeing the same issues. Intermittently nodes are dropping off and not connecting back again. I've cut also a release of c3os with it, where the issue can be observed too https://github.com/c3os-io/c3os/releases/tag/v1.21.4-36

@mudler
Copy link
Owner Author

mudler commented Mar 17, 2022

There seems to be a slightly difference, it seems to happen when I start to send over big chunk of data. It survives pings and other stuff just good.

mudler added a commit that referenced this issue Mar 17, 2022
Also reverts rcmgr configurations

See #12
@mudler
Copy link
Owner Author

mudler commented Mar 18, 2022

ok disabling the rcmgr make everything work as usual, so it has to do with the defaults limits probably. I was using rcmgr defaults in my first attempts so maybe that was too much conservative indeed.

I'm going to disable by default rcmgr and play around it until I get some good defaults by running some benchmarks and maybe reuse the same maxConns approach as in lotus to see if that's suits defaults for my case as well (might not fit very well on Pis, but shall see :) ).

@vyzo
Copy link

vyzo commented Mar 18, 2022

yeah, the default inbound conn limit is very conservative.

@mudler mudler closed this as completed in e0ccd8c Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants