Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulcan-node instability? #976

Closed
coleshaw opened this issue Aug 1, 2022 · 9 comments
Closed

Vulcan-node instability? #976

coleshaw opened this issue Aug 1, 2022 · 9 comments

Comments

@coleshaw
Copy link
Collaborator

coleshaw commented Aug 1, 2022

Just documenting notes here, not sure anything is very conclusive or helpful:

I've had to restart the docker process on vulcan-node last Friday and this morning, when things seem stuck. "stuck" being the following symptoms:

  • Metis UI behaves sporadically. Folder contents not returned, but then works on refresh, etc.
  • Metis Puma throws the 500 for 5.0s timeout on the postgres connection
  • Portainer "agent unavailable" or "agent not found", and thus cannot manage a stack or even list the stacks.

When I check medusa, it looks like vulcan-node is unreachable? First output in the following screenshot.

Screenshot from 2022-08-01 12-21-05

Try stopping the portainer agent on vulcan-node, and it kind of hangs. Can restart portainer-agent container on other nodes, no problem.

Wind up restarting docker on vulcan-node, with systemctl restart docker. Node is then reachable (second output in the above screenshot), Portainer works, and Metis seems accessible.

Happened today just before standup, around 8:45am? Notably around that time, you can see some error logs on vulcan-node for docker, but I'm not really sure if they are relevant or the timing is coincidence ... the logs are very chatty.

Screenshot from 2022-08-01 12-45-01

Grafana shows no noticeable uptick in CPU or memory usage at that time for this node, and rather actually a CPU drop (probably when I restarted docker).

Seems suspicious that it's been vulcan-node both times, and restarting seems to make things better.

Going to keep digging and collect notes here. Not really sure if upgrading to a newer docker engine version might help, a la this similar issue, or perhaps something else...

@corps
Copy link
Contributor

corps commented Aug 1, 2022

Yup. My strategy moving forward on this in particular now is a kernel upgrade. This is been frustrating, but I'm noticing this too.

@coleshaw
Copy link
Collaborator Author

coleshaw commented Aug 1, 2022

Are all the nodes the same, or since vulcan-node is one of the older ones, it has an older kernel? I noticed the docker engine version is only slightly behind...

@corps
Copy link
Contributor

corps commented Aug 1, 2022

I upgraded all the kernels, so even vulcan-node is up to date on that

root@vulcan-node:~# uname -sr
Linux 4.15.0-176-generic

I'm still pondering what's happening, I believe it is networking related and potentially an interaction with VMware.

@corps
Copy link
Contributor

corps commented Aug 1, 2022

Sorry, my 'strategy' on this is not a kernel upgrade (we already did) but potentially an unbuntu upgrade. I'm wondering if changes to the ubuntu network software stack could help.

@coleshaw
Copy link
Collaborator Author

coleshaw commented Aug 1, 2022

Hm, is an older Ubuntu stack something that you would expect to be affecting the other nodes, too? I haven't seen the other ones get stuck, only vulcan-node, but maybe I just haven't been paying attention to that, yet.

@corps
Copy link
Contributor

corps commented Aug 2, 2022

Ok, so the journalctl -xef output is very noisy and it has LOTS of warnings, but I wanted to be clear, these are actually... false positive issues.

Messages like this

Aug 01 17:10:48 vulcan-node dockerd[11915]: time="2022-08-01T17:10:48.755269757-07:00" level=warning msg="Entry was not in db: nid:i9aqii3hz3at8i2elnukrqjei eid:644f8df8dfd39e32c
51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 peerIP:10.0.13.79 peerMac:02:42:0a:00:0d:4f isLocal:false vtep:192.168.2.159"                                                     
Aug 01 17:10:48 vulcan-node dockerd[11915]: time="2022-08-01T17:10:48.755366908-07:00" level=warning msg="Peer operation failed:could not delete fdb entry for nid:i9aqii3hz3at8i2
elnukrqjei eid:644f8df8dfd39e32c51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 into the sandbox:Search neighbor failed for IP 192.168.2.159, mac 02:42:0a:00:0d:4f, present in db
:false op:&{2 i9aqii3hz3at8i2elnukrqjei 644f8df8dfd39e32c51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 [0 0 0 0 0 0 0 0 0 0 255 255 10 0 13 79] [255 255 255 0] [2 66 10 0 13 79
] [0 0 0 0 0 0 0 0 0 0 255 255 192 168 2 159] false false false EventNotify}"

Are safe. The peer operations are simply failing on the vulcan default network because of the archimedes containers which come up and go down every 10 seconds. They are there mostly to keep local copies of the archimedes image (since archimede isn't being run as a true service atm). Can confirm that stopping vulcan causes these messages to go away and has no impact on overall service.

Things related to udevd, like

Aug 01 17:17:07 vulcan-node networkd-dispatcher[962]: WARNING:Unknown index 3825617 seen, reloading interface list
Aug 01 17:18:26 vulcan-node systemd-udevd[17056]: link_config: could not get ethtool features for veth9a2e5af
Aug 01 17:18:26 vulcan-node systemd-udevd[17056]: Could not set offload features of veth9a2e5af: No such device

or things related to the veth (such as persistent mac address) are also... just noisy warnings related to changes in modern linux kernel not being super compatible with vxlan objects (virtual, named lans) with default ubuntu settings. But in reality the network behavior of these bridges work fine.

So I'm narrowing down on what's really happening bit by bit.

@corps
Copy link
Contributor

corps commented Aug 2, 2022

I'm also turning off our backup jobs. I am curious if somehow those jobs are putting undue pressure on the system, but I would be surprised if so.

@coleshaw
Copy link
Collaborator Author

coleshaw commented Aug 2, 2022

I see, those warning messages are quite a lot. Good to know they don't impact the service, though!

I did see some sporadic Error messages -- do you think those are related?

@corps
Copy link
Contributor

corps commented Aug 2, 2022

I'm still definitely digging various errors and trying things, is there one in particular that caught your eye?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants