Vulcan-node instability? #976

coleshaw · 2022-08-01T17:12:34Z

Just documenting notes here, not sure anything is very conclusive or helpful:

I've had to restart the docker process on vulcan-node last Friday and this morning, when things seem stuck. "stuck" being the following symptoms:

Metis UI behaves sporadically. Folder contents not returned, but then works on refresh, etc.
Metis Puma throws the 500 for 5.0s timeout on the postgres connection
Portainer "agent unavailable" or "agent not found", and thus cannot manage a stack or even list the stacks.

When I check medusa, it looks like vulcan-node is unreachable? First output in the following screenshot.

Try stopping the portainer agent on vulcan-node, and it kind of hangs. Can restart portainer-agent container on other nodes, no problem.

Wind up restarting docker on vulcan-node, with systemctl restart docker. Node is then reachable (second output in the above screenshot), Portainer works, and Metis seems accessible.

Happened today just before standup, around 8:45am? Notably around that time, you can see some error logs on vulcan-node for docker, but I'm not really sure if they are relevant or the timing is coincidence ... the logs are very chatty.

Grafana shows no noticeable uptick in CPU or memory usage at that time for this node, and rather actually a CPU drop (probably when I restarted docker).

Seems suspicious that it's been vulcan-node both times, and restarting seems to make things better.

Going to keep digging and collect notes here. Not really sure if upgrading to a newer docker engine version might help, a la this similar issue, or perhaps something else...

The text was updated successfully, but these errors were encountered:

corps · 2022-08-01T17:19:22Z

Yup. My strategy moving forward on this in particular now is a kernel upgrade. This is been frustrating, but I'm noticing this too.

coleshaw · 2022-08-01T17:22:51Z

Are all the nodes the same, or since vulcan-node is one of the older ones, it has an older kernel? I noticed the docker engine version is only slightly behind...

corps · 2022-08-01T17:27:05Z

I upgraded all the kernels, so even vulcan-node is up to date on that

root@vulcan-node:~# uname -sr
Linux 4.15.0-176-generic

I'm still pondering what's happening, I believe it is networking related and potentially an interaction with VMware.

corps · 2022-08-01T17:40:07Z

Sorry, my 'strategy' on this is not a kernel upgrade (we already did) but potentially an unbuntu upgrade. I'm wondering if changes to the ubuntu network software stack could help.

coleshaw · 2022-08-01T17:42:22Z

Hm, is an older Ubuntu stack something that you would expect to be affecting the other nodes, too? I haven't seen the other ones get stuck, only vulcan-node, but maybe I just haven't been paying attention to that, yet.

corps · 2022-08-02T00:19:41Z

Ok, so the journalctl -xef output is very noisy and it has LOTS of warnings, but I wanted to be clear, these are actually... false positive issues.

Messages like this

Aug 01 17:10:48 vulcan-node dockerd[11915]: time="2022-08-01T17:10:48.755269757-07:00" level=warning msg="Entry was not in db: nid:i9aqii3hz3at8i2elnukrqjei eid:644f8df8dfd39e32c
51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 peerIP:10.0.13.79 peerMac:02:42:0a:00:0d:4f isLocal:false vtep:192.168.2.159"                                                     
Aug 01 17:10:48 vulcan-node dockerd[11915]: time="2022-08-01T17:10:48.755366908-07:00" level=warning msg="Peer operation failed:could not delete fdb entry for nid:i9aqii3hz3at8i2
elnukrqjei eid:644f8df8dfd39e32c51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 into the sandbox:Search neighbor failed for IP 192.168.2.159, mac 02:42:0a:00:0d:4f, present in db
:false op:&{2 i9aqii3hz3at8i2elnukrqjei 644f8df8dfd39e32c51fe70ff4f0c008f2df0196894c92e5127896d573ce89b1 [0 0 0 0 0 0 0 0 0 0 255 255 10 0 13 79] [255 255 255 0] [2 66 10 0 13 79
] [0 0 0 0 0 0 0 0 0 0 255 255 192 168 2 159] false false false EventNotify}"

Are safe. The peer operations are simply failing on the vulcan default network because of the archimedes containers which come up and go down every 10 seconds. They are there mostly to keep local copies of the archimedes image (since archimede isn't being run as a true service atm). Can confirm that stopping vulcan causes these messages to go away and has no impact on overall service.

Things related to udevd, like

Aug 01 17:17:07 vulcan-node networkd-dispatcher[962]: WARNING:Unknown index 3825617 seen, reloading interface list
Aug 01 17:18:26 vulcan-node systemd-udevd[17056]: link_config: could not get ethtool features for veth9a2e5af
Aug 01 17:18:26 vulcan-node systemd-udevd[17056]: Could not set offload features of veth9a2e5af: No such device

or things related to the veth (such as persistent mac address) are also... just noisy warnings related to changes in modern linux kernel not being super compatible with vxlan objects (virtual, named lans) with default ubuntu settings. But in reality the network behavior of these bridges work fine.

So I'm narrowing down on what's really happening bit by bit.

corps · 2022-08-02T00:22:27Z

I'm also turning off our backup jobs. I am curious if somehow those jobs are putting undue pressure on the system, but I would be surprised if so.

coleshaw · 2022-08-02T13:44:14Z

I see, those warning messages are quite a lot. Good to know they don't impact the service, though!

I did see some sporadic Error messages -- do you think those are related?

corps · 2022-08-02T17:46:38Z

I'm still definitely digging various errors and trying things, is there one in particular that caught your eye?

coleshaw mentioned this issue Sep 12, 2022

Build: HTTP error when running portainer-edit-config #1007

Closed

coleshaw closed this as completed Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulcan-node instability? #976

Vulcan-node instability? #976

coleshaw commented Aug 1, 2022

corps commented Aug 1, 2022

coleshaw commented Aug 1, 2022

corps commented Aug 1, 2022

corps commented Aug 1, 2022

coleshaw commented Aug 1, 2022

corps commented Aug 2, 2022

corps commented Aug 2, 2022

coleshaw commented Aug 2, 2022

corps commented Aug 2, 2022

Vulcan-node instability? #976

Vulcan-node instability? #976

Comments

coleshaw commented Aug 1, 2022

corps commented Aug 1, 2022

coleshaw commented Aug 1, 2022

corps commented Aug 1, 2022

corps commented Aug 1, 2022

coleshaw commented Aug 1, 2022

corps commented Aug 2, 2022

corps commented Aug 2, 2022

coleshaw commented Aug 2, 2022

corps commented Aug 2, 2022