Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokio-runtime segfault #1993

Closed
lightclient opened this issue Nov 27, 2020 · 15 comments
Closed

tokio-runtime segfault #1993

lightclient opened this issue Nov 27, 2020 · 15 comments

Comments

@lightclient
Copy link
Contributor

lightclient commented Nov 27, 2020

Description

I encountered some db corruption and I believe the root cause is lighthouse crashes regularly due to an issue with the tokio runtime. I'm continuing to debug. One concern is that the VPS I am running on doesn't have stellar stability reviews. I'm wondering if that is causing the runtime to go awry. However, I have managed to maintain 100% uptime on my go-ethereum nodes.

Version

I've run into this issue on both 5828ff1, v1.0, and v1.0.1.

Present Behaviour

The tokio-runtime segfaults after a period of time:

tokio-runtime-w[3460]: segfault at 470ba01feea ip 00005641f1424f1b sp 00007f59e8ee21f8 error 6 in lighthouse[5641ef75e000+1d9e000]
@lightclient
Copy link
Contributor Author

Debugging this further, it seems that anytime I get close to maxing out the RAM it crashes.

@michaelsproul
Copy link
Member

We haven't seen this issue on any other hardware, so it would be super interesting to know what's unique about your setup that's causing this.

It might be a bug in Tokio related to the specific (janky?) hypervisor your VPS provider uses. You could run something like slabbed-or-not to try and work out which hypervisor that might be https://github.com/kaniini/slabbed-or-not

Some of the mysterious database errors we've seen have also been from people running under a hypervisor, which might just be a coincidence, but I'm not sure.

The bug may happen to be fixed by the Tokio 0.3 change (which is almost ready for release in v1.0.2)

@lightclient
Copy link
Contributor Author

I'm running on Contabo VPS.

$ ./slabbed-or-not
Not running under any known container type
Hypervisor: KVM

I'll keep an eye for the Tokio 0.3 release :)

@AgeManning
Copy link
Member

The tokio 0.3 release has been merged. Let us know if the issue persists.

I believe one cause of this issue was running with a high peer count (likely more than the computer can handle).

@AgeManning
Copy link
Member

I'm going to close this issue, assuming it has been resolved. Please re-open if the issue persists.

@lightclient
Copy link
Contributor Author

It does appear a bit more stable, but still seeing the same issue on 2383bfe with 50 peers. I will try to ramp down to 30 to see if it improves.

@lightclient
Copy link
Contributor Author

@AgeManning FYI, it doesn't look like I have the ability to reopen.

@michaelsproul michaelsproul reopened this Nov 29, 2020
@AgeManning
Copy link
Member

hmm are you running any strange hardware? What OS?

@AgeManning
Copy link
Member

I only saw one reported issue of seg faults in tokio recently and looks like it's been fixed in 0.3: tokio-rs/tokio#3019

@lightclient
Copy link
Contributor Author

@AgeManning running Ubuntu 20.04 on a VPS. 8GB ram and 4 xeon vCores. The crash is relatively consistent, I've reprovisioned a few times only to find the same error. Maybe this week I can figure out how to get a proper core dump to share.

@pawanjay176
Copy link
Member

pawanjay176 commented Nov 30, 2020

@lightclient which lighthouse version are you running?

@lightclient
Copy link
Contributor Author

@pawanjay176 I've run all of the following: 5828ff1, v1.0 (c6baa0e) , v1.0.1 (5a3b94c), cut-v1.0.2 (2383bfe), and v1.0.2 (f718309).

@AgeManning
Copy link
Member

AgeManning commented Nov 30, 2020

Are you building the binary locally on the box?

Can you try running the portable version and see if it also happens there

make build-x86_64-portable

and use the binary at target/x86_64-unknown-linug-gnu/release/lighthouse

@lightclient
Copy link
Contributor Author

@AgeManning okay, I'll give that a shot. I've been building locally and have been using the optimized version.

@paulhauner
Copy link
Member

Closing this since I assumed it's been fixed by upgrading tokio. Please reopen if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants