Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resynching mechanism to endpoint_table of networkDB #47728

Open
s4ke opened this issue Apr 17, 2024 · 11 comments
Open

Add resynching mechanism to endpoint_table of networkDB #47728

s4ke opened this issue Apr 17, 2024 · 11 comments
Labels
area/networking/d/overlay area/networking exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

Comments

@s4ke
Copy link
Contributor

s4ke commented Apr 17, 2024

Description

In our setups, we keep having issues around Docker DNS resolution around times where we either:

  1. restart Docker nodes in quick succession
  2. update Docker nodes (and therefore restart them in quick succession)
  3. have network issues

For a second we thought that the MTU settings for the networking controlplane might be the issue, but the issue seems to have happened even on an MTU that fits our setup (we used 1350 instead of the default 1500).

It seems that during these times, the gossip network of networkdb does not get synched up properly. We debugged this with the built in debugging tooling of libnetwork (which is really helpful btw) and found that there exists no mechanism in dockerd that concerns itself with resynching the endpoint_table of networkdb (or overlay_peer_table for that matter but that does not seem to be that much of a problem). We double checked this by going through the code of libnetwork/agent.go:

The only places that update anything in networkdb are calls to addServiceInfoToCluster (CreateEntry), addDriverInfoToCluster (CreateEntry), deleteDriverInfoFromCluster (DeleteEntry), deleteServiceInfoFromCluster (DeleteEntry). disableServiceInNetworkDB (UpdateEntry).

While we might have missed things, here comes our proposal: There should be a (opt-in) docker daemon config option that enables a background job in the docker daemon that on a schedule resyncs all the DNS entries to networkdb. Design wise I have not thought about it a lot, but I imagine that this should be fine to run on a schedule of maybe 1-5 minutes in most clusters. This way, whenever a DNS entry is out of sync things should fix themselves in a somewhat acceptable schedule.

@s4ke s4ke added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/0-triage labels Apr 17, 2024
@s4ke
Copy link
Contributor Author

s4ke commented Apr 17, 2024

I also found bulkSync in libnetwork/networkdb/cluster.go but this seems to only synchronize the "already known state". This means that if something went bad at some point, there is no way the system can fix itself still if I read things correctly?

@s4ke
Copy link
Contributor Author

s4ke commented Apr 17, 2024

@robmry I have seen a lot of changes by you in libnetwork in the recent weeks. I'd be interested in your gut feeling here. Would this be a bad idea?

If this is something that is worth exploring I'd be happy to help with the implementation. This has been bugging us for a lot of time and if this helps the way that I envision it, this would make our clusters so much more reliable.

/cc @neersighted

@robmry
Copy link
Contributor

robmry commented Apr 18, 2024

HI @s4ke - thank you for the debugging, description and proposal. As I've not looked at Swarm in any detail yet, I think @corhere is much better placed to take a look at this.

@corhere
Copy link
Contributor

corhere commented Apr 18, 2024

Swarm is only tangentially involved: it informs libnetwork of the addresses of the remote peers for bootstrapping/joining the NetworkDB gossip cluster. NetworkDB and the memberlist gossip cluster are entirety within the domain of libnetwork, unfortunately.

@s4ke
Copy link
Contributor Author

s4ke commented Apr 18, 2024

Yeah, that is my understanding as well. But this problem only materializes in Docker Swarm environments, because networkdb does not seem to be involved in single node setups. Or well... if it is, no data is lost because its all on the same node anyways.

In the end it all boils down to this - how bad of an idea would it be to build some automatic resynching mechanism for networkdb?

Also, what I found when looking into memberlist (the underlying library for networkdb) - the behaviour of the networkdb being broken in non optimal network conditions is quite likely without a resynch mechanism.

@corhere corhere added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking area/networking/d/overlay and removed kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Apr 18, 2024
@corhere
Copy link
Contributor

corhere commented Apr 18, 2024

NetworkDB not converging in a reasonable amount of time is a bug. It does look like bulkSync is intended to be that resynching mechanism you suggest, though apparently it is not working properly. To a first approximation, NetworkDB is a distributed KV store with multiple "keyspaces" (called networks) where each node in the cluster gets to pick and choose which keyspaces to participate in. A node only holds the state for the keyspaces it participates in, and therefore can only receive state updates from other nodes also participating. I suspect (from a cursory reading of a small slice of the networkdb package source) that there is a design flaw in how this is implemented: the mapping of nodes to networks is only updated with broadcasts, with no mechanism for recovering from a missed broadcast. That mapping probably has to be reliably distributed—like the KV pairs themselves are—to all nodes in the cluster for bulk sync to reliably converge in imperfect conditions.

@s4ke
Copy link
Contributor Author

s4ke commented Apr 18, 2024

Thanks for looking into it. For what it's worth, I can somewhat confirm this feeling from me reading into the code for about 2-3 hours.

Happy to help here if it's needed in any way.

@s4ke
Copy link
Contributor Author

s4ke commented Apr 23, 2024

Just to clarify - is this something that will likely be picked up by a subject matter expert? Happy to help anyways, but I think implementing the full thing is out of reach for me at the moment.

@corhere
Copy link
Contributor

corhere commented Apr 23, 2024

I will likely be the one to pick this up unless you or someone else in the community steps up before this reaches the top of my backlog. (I tagged this issue exp/expert to warn away community members looking for an easy first contribution.)

My current idea is to implement a "system" keyspace which uses the same reliable synchronization machinery as the per-network tables—which all nodes in the cluster unconditionally participate in—and distribute the mapping of nodes to networks through a system table.

@corhere
Copy link
Contributor

corhere commented May 1, 2024

@s4ke are Swarm tasks affected, basic containers connected to an attachable overlay, or both? I wonder if there are any other root causes, such as race conditions, to be on the lookout for.

@s4ke
Copy link
Contributor Author

s4ke commented May 2, 2024

This issue mostly came up in Docker Swarm environments with services/tasks due to the nature of containers joining and leaving networks frequently.

But: I am 99% sure that we encountered this issue with standalone Docker Containers attached to overlay networks as well. (Older setups where we didn't use anything besides overlay networks out of the Docker Swarm mode feature set).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking/d/overlay area/networking exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

No branches or pull requests

3 participants