New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dockerd fails to start if there are routes for all private networks although a custom bip is configured #33925
Comments
Hello, Today a colleague of mine asked me if I had changed something on the network because his Docker configuration was suddenly giving a lot of problems. At first I did not know what he was talking about but after some questions, it slowly became clear to me that he had problems starting his docker environment when his VPN connection to the office was online. I looked at the error message that he received and I saw the following: "Error initializing network controller: list bridge addresses failed: no available network". This was very strange because the network he had configured in his daemon.yaml looked like this: In our corporate network we have a lot of RFC1918 networks, a few in the 10.x.x.x/8 range, a lot in the 172.16.0.0/12 and 192.168.0.0/16 ranges. But nothing that collides with above ranges, and even if something would collide, it was all local on his workstation where he was developing and testing some monitoring systems, and he is completely free to use whatever network he wants to use locally, as long as he doesn't interfere with the corporate network. On the VPN router I have a default set of routes set for RFC1918 networks pointing towards the corporate routers, so everyone can reach the internal corporate networks without having to worry about anything. The firewalls will take care of the rest. I started debugging the error message and did some Google searches and I found a lot of people complaining about exactly this same problem. Some example tickets: At first the error didn't make any sense to me because:
But then I thought about something. What if the docker code, searching for free networks, takes the local routing table and checks the configured network against EVERY route in the routing table. If something matches or overlaps the route in the routing table it gives this error. At first I thought that this couldn't be true because this would always fail because a default route of 0.0.0.0/0 would always match. But what if this default route is filtered out in the code for this specific reason. Then this hypothesis could be the truth. I started testing locally on my own system, first I reproduced the error:
The resulting routing table: Then I started my VPN. The result was 3 extra routes: I then stopped my docker daemon and tried to start it again, and indeed I received the same error. So I could reproduce the problem, now for my hypothesis: "Does the code check EVERY route in the routing table, filtering out the default route." To test this I did the following: My routing table then looks like this: The only difference between this state and a clean state of my system, is not having a default route, but having two routes that are together the default route of my system. Now I tried to start the docker daemon again. If the daemon starts fine my hypothesis is wrong and I have to continue my search. If the daemon fails then my hypothesis must me correct because the default route is the only difference in my local configuration. And indeed, I received the same error again. Now I'm sure there is absolutely no reason to give this error because:
This also proves my hypothesis that every route in the routing table is being checked against the configured network, filtering out the default route. If any route matches the configured network, the configuration is rejected. This is a bug in the docker code. The code should be changed to only match routes with "scope link" because these routes are directly connected and would be a problem when you start a docker daemon with an overlapping network configuration. Any route that is not "scope link" should be ignored because those routes could be:
There is one corner case where you could give a warning or maybe an error. This is when there is an equal or more specific route that is not "scope link". Because this could result in routing issues to other systems. But even then, I would make it configurable because it could very well be that this is intentional. I'm not a developer but a network and systems engineer, so I am not able at the moment to provide a patch for this problem, but one of my colleagues thought that he had already found some parts of the code. So maybe ........ The version I have tested this with is: Docker version 19.03.13, build 4484c46d9d Cheers, |
One of my colleagues has done some more digging, and he thinks the reason the code actually works at the moment, the way it works, is because the code that is going through the routing table is not able to interpret the default route entry because it is not an IP address/netmask. The result is a NIL entry in the routing table object, or whatever it is called in Go, and this in turn results in that the default route is not interfering with the network selection. If he default route would have returned 0.0.0.0/0 for example, then the code would have failed from the beginning. He also found that it is probably very simple to put a filter in the code that iterates through the routing table, and make sure that it only takes routes with a link scope. If you do this, a lot of people will be a lot happier. |
@GordonTheTurtle still an issue with more recent versions. Please add tag(s) to match. $ docker --version
Docker version 19.03.13, build 4484c46 also with $ docker --version
Docker version 20.10.0-rc1, build 5cc2396 |
The check seems to have been introduced in b0d046a. It seems to be a refactoring and there is no mention of such a change. Maybe it the check should not be done at all. It is not robust as a "conflicting" network can appear after Docker has started, making the debug all the more difficult (sometimes it work, sometimes not). It seems difficult to be smart about that. I would suggest that Docker stays dumb and just use the expected network. Less code, less problems. PR from @deepy would also fix my use case, but it relies on the coincidence that this check would solve most issues while link-scoped links have no direct relations with the importance of not shadowing the route. A static route, like |
IT appears Docker currently is unable to fulltunnel in OpenVPN due to routes - as mentioned here several times, and this should be seriously worked upon? The solution is to get your admin to switch to splittunnel.
|
Description
All our hosts have configured routes for the following networks:
So the default networks that
dockerd
tries to use for thedocker0
bridge are already in use. Therefore I specify a custombip
which docker should use instead.Steps to reproduce the issue:
dockerd
with abip
which isn't part of one of the networksDescribe the results you received:
Relevant part of the logs:
If I configure the
docker0
bridge manuallydockerd
starts sucessfully.Describe the results you expected:
dockerd
shouldn't check for available networks if a custombip
is specified.Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:The text was updated successfully, but these errors were encountered: