-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulcan-node instability? #976
Comments
Yup. My strategy moving forward on this in particular now is a kernel upgrade. This is been frustrating, but I'm noticing this too. |
Are all the nodes the same, or since vulcan-node is one of the older ones, it has an older kernel? I noticed the docker engine version is only slightly behind... |
I upgraded all the kernels, so even vulcan-node is up to date on that
I'm still pondering what's happening, I believe it is networking related and potentially an interaction with VMware. |
Sorry, my 'strategy' on this is not a kernel upgrade (we already did) but potentially an unbuntu upgrade. I'm wondering if changes to the ubuntu network software stack could help. |
Hm, is an older Ubuntu stack something that you would expect to be affecting the other nodes, too? I haven't seen the other ones get stuck, only |
Ok, so the journalctl -xef output is very noisy and it has LOTS of warnings, but I wanted to be clear, these are actually... false positive issues. Messages like this
Are safe. The peer operations are simply failing on the vulcan default network because of the archimedes containers which come up and go down every 10 seconds. They are there mostly to keep local copies of the archimedes image (since archimede isn't being run as a true service atm). Can confirm that stopping vulcan causes these messages to go away and has no impact on overall service. Things related to udevd, like
or things related to the veth (such as persistent mac address) are also... just noisy warnings related to changes in modern linux kernel not being super compatible with vxlan objects (virtual, named lans) with default ubuntu settings. But in reality the network behavior of these bridges work fine. So I'm narrowing down on what's really happening bit by bit. |
I'm also turning off our backup jobs. I am curious if somehow those jobs are putting undue pressure on the system, but I would be surprised if so. |
I see, those warning messages are quite a lot. Good to know they don't impact the service, though! I did see some sporadic Error messages -- do you think those are related? |
I'm still definitely digging various errors and trying things, is there one in particular that caught your eye? |
Just documenting notes here, not sure anything is very conclusive or helpful:
I've had to restart the
docker
process onvulcan-node
last Friday and this morning, when things seem stuck. "stuck" being the following symptoms:When I check medusa, it looks like
vulcan-node
is unreachable? First output in the following screenshot.Try stopping the portainer agent on
vulcan-node
, and it kind of hangs. Can restartportainer-agent
container on other nodes, no problem.Wind up restarting docker on vulcan-node, with
systemctl restart docker
. Node is then reachable (second output in the above screenshot), Portainer works, and Metis seems accessible.Happened today just before standup, around 8:45am? Notably around that time, you can see some error logs on vulcan-node for docker, but I'm not really sure if they are relevant or the timing is coincidence ... the logs are very chatty.
Grafana shows no noticeable uptick in CPU or memory usage at that time for this node, and rather actually a CPU drop (probably when I restarted docker).
Seems suspicious that it's been vulcan-node both times, and restarting seems to make things better.
Going to keep digging and collect notes here. Not really sure if upgrading to a newer docker engine version might help, a la this similar issue, or perhaps something else...
The text was updated successfully, but these errors were encountered: