Add resynching mechanism to endpoint_table of networkDB #47728

s4ke · 2024-04-17T17:33:03Z

Description

In our setups, we keep having issues around Docker DNS resolution around times where we either:

restart Docker nodes in quick succession
update Docker nodes (and therefore restart them in quick succession)
have network issues

For a second we thought that the MTU settings for the networking controlplane might be the issue, but the issue seems to have happened even on an MTU that fits our setup (we used 1350 instead of the default 1500).

It seems that during these times, the gossip network of networkdb does not get synched up properly. We debugged this with the built in debugging tooling of libnetwork (which is really helpful btw) and found that there exists no mechanism in dockerd that concerns itself with resynching the endpoint_table of networkdb (or overlay_peer_table for that matter but that does not seem to be that much of a problem). We double checked this by going through the code of libnetwork/agent.go:

The only places that update anything in networkdb are calls to addServiceInfoToCluster (CreateEntry), addDriverInfoToCluster (CreateEntry), deleteDriverInfoFromCluster (DeleteEntry), deleteServiceInfoFromCluster (DeleteEntry). disableServiceInNetworkDB (UpdateEntry).

While we might have missed things, here comes our proposal: There should be a (opt-in) docker daemon config option that enables a background job in the docker daemon that on a schedule resyncs all the DNS entries to networkdb. Design wise I have not thought about it a lot, but I imagine that this should be fine to run on a schedule of maybe 1-5 minutes in most clusters. This way, whenever a DNS entry is out of sync things should fix themselves in a somewhat acceptable schedule.

The text was updated successfully, but these errors were encountered:

s4ke · 2024-04-17T17:41:29Z

I also found bulkSync in libnetwork/networkdb/cluster.go but this seems to only synchronize the "already known state". This means that if something went bad at some point, there is no way the system can fix itself still if I read things correctly?

s4ke · 2024-04-17T21:17:25Z

@robmry I have seen a lot of changes by you in libnetwork in the recent weeks. I'd be interested in your gut feeling here. Would this be a bad idea?

If this is something that is worth exploring I'd be happy to help with the implementation. This has been bugging us for a lot of time and if this helps the way that I envision it, this would make our clusters so much more reliable.

/cc @neersighted

robmry · 2024-04-18T07:58:50Z

HI @s4ke - thank you for the debugging, description and proposal. As I've not looked at Swarm in any detail yet, I think @corhere is much better placed to take a look at this.

corhere · 2024-04-18T15:03:29Z

Swarm is only tangentially involved: it informs libnetwork of the addresses of the remote peers for bootstrapping/joining the NetworkDB gossip cluster. NetworkDB and the memberlist gossip cluster are entirety within the domain of libnetwork, unfortunately.

s4ke · 2024-04-18T15:22:14Z

Yeah, that is my understanding as well. But this problem only materializes in Docker Swarm environments, because networkdb does not seem to be involved in single node setups. Or well... if it is, no data is lost because its all on the same node anyways.

In the end it all boils down to this - how bad of an idea would it be to build some automatic resynching mechanism for networkdb?

Also, what I found when looking into memberlist (the underlying library for networkdb) - the behaviour of the networkdb being broken in non optimal network conditions is quite likely without a resynch mechanism.

corhere · 2024-04-18T18:04:07Z

NetworkDB not converging in a reasonable amount of time is a bug. It does look like bulkSync is intended to be that resynching mechanism you suggest, though apparently it is not working properly. To a first approximation, NetworkDB is a distributed KV store with multiple "keyspaces" (called networks) where each node in the cluster gets to pick and choose which keyspaces to participate in. A node only holds the state for the keyspaces it participates in, and therefore can only receive state updates from other nodes also participating. I suspect (from a cursory reading of a small slice of the networkdb package source) that there is a design flaw in how this is implemented: the mapping of nodes to networks is only updated with broadcasts, with no mechanism for recovering from a missed broadcast. That mapping probably has to be reliably distributed—like the KV pairs themselves are—to all nodes in the cluster for bulk sync to reliably converge in imperfect conditions.

s4ke · 2024-04-18T18:30:07Z

Thanks for looking into it. For what it's worth, I can somewhat confirm this feeling from me reading into the code for about 2-3 hours.

Happy to help here if it's needed in any way.

s4ke · 2024-04-23T17:03:34Z

Just to clarify - is this something that will likely be picked up by a subject matter expert? Happy to help anyways, but I think implementing the full thing is out of reach for me at the moment.

corhere · 2024-04-23T20:13:29Z

I will likely be the one to pick this up unless you or someone else in the community steps up before this reaches the top of my backlog. (I tagged this issue exp/expert to warn away community members looking for an easy first contribution.)

My current idea is to implement a "system" keyspace which uses the same reliable synchronization machinery as the per-network tables—which all nodes in the cluster unconditionally participate in—and distribute the mapping of nodes to networks through a system table.

corhere · 2024-05-01T17:59:21Z

@s4ke are Swarm tasks affected, basic containers connected to an attachable overlay, or both? I wonder if there are any other root causes, such as race conditions, to be on the lookout for.

s4ke · 2024-05-02T07:59:58Z

This issue mostly came up in Docker Swarm environments with services/tasks due to the nature of containers joining and leaving networks frequently.

But: I am 99% sure that we encountered this issue with standalone Docker Containers attached to overlay networks as well. (Older setups where we didn't use anything besides overlay networks out of the Docker Swarm mode feature set).

s4ke added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/0-triage labels Apr 17, 2024

corhere added exp/expert and removed status/0-triage labels Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resynching mechanism to endpoint_table of networkDB #47728

Add resynching mechanism to endpoint_table of networkDB #47728

s4ke commented Apr 17, 2024 •

edited

s4ke commented Apr 17, 2024 •

edited

s4ke commented Apr 17, 2024 •

edited

robmry commented Apr 18, 2024

corhere commented Apr 18, 2024

s4ke commented Apr 18, 2024

corhere commented Apr 18, 2024

s4ke commented Apr 18, 2024

s4ke commented Apr 23, 2024

corhere commented Apr 23, 2024

corhere commented May 1, 2024

s4ke commented May 2, 2024 •

edited

Add resynching mechanism to endpoint_table of networkDB #47728

Add resynching mechanism to endpoint_table of networkDB #47728

Comments

s4ke commented Apr 17, 2024 • edited

Description

s4ke commented Apr 17, 2024 • edited

s4ke commented Apr 17, 2024 • edited

robmry commented Apr 18, 2024

corhere commented Apr 18, 2024

s4ke commented Apr 18, 2024

corhere commented Apr 18, 2024

s4ke commented Apr 18, 2024

s4ke commented Apr 23, 2024

corhere commented Apr 23, 2024

corhere commented May 1, 2024

s4ke commented May 2, 2024 • edited

s4ke commented Apr 17, 2024 •

edited

s4ke commented Apr 17, 2024 •

edited

s4ke commented Apr 17, 2024 •

edited

s4ke commented May 2, 2024 •

edited