Proposal: Native Docker Multi-Host Networking #8951

nerdalert · 2014-11-04T17:51:08Z

Native Docker Multi-Host Networking

TL;DR Practical SDN for Docker

Authors: @dave-tucker, @mavenugo and @nerdalert.

Background
Problem Statement
Solution Components
Proposal
Single Host Network Deployment Scenarios
Multi Host Network Deployment Scenarios
Summary

Background

Application virtualization will have a significant impact on the future of data center networks. Compute virtualization has driven the edge of the network into the server and more specifically the virtual switch. The compute workload efficiencies derived from Docker containers will dramatically increase the density of network requirements in the server. Scaling this density will require reliable network fundamentals, while also ensuring the developer has as much or little interaction with the network as is desired.

A tightly coupled and native integration to Docker will ensure there is a base functionality that capable of integrating into the vast majority of data center network architectures today and help reduce the barriers to Docker adoption for the user. Just as important for the diverse user base, is making Docker networking dead simple for the to integrate, provision and troubleshoot.

The first step is a Native Docker Networking solution today that can handle Multi-Host environment which scales to production requirements and that works well with the existing network deployments / operations.

Problem Statement

Though there are a few existing multi-host networking solutions, they are currently designed more as over-the-top solutions on top of Docker that either:

Address a specific use case
Address a specific orchestration system deployment
Do not scale to the production requirements
Do not work well with existing production network and operations.

The core of this proposal is to bring multi-host networking as a native part of Docker that handles most of the use-cases, scales and works well with the existing production network and operations. With this provided as a native Docker solution, every orchestration system can enjoy the benefits alike.

There are three ways to approach multi-host networking in docker:

NAT-based : Just hide the containers behind the docker host IP address. Job Done.
IP-based Each container should have it’s own unique IP address
Hybrid. A mix of the above

NAT-based

The first option (NAT-based) works by hiding the the containers behind a Docker Host IP address. The TCP port exposed by a given Docker container is mapped to an unique port on the Host machine.

Since the mapped host port has to be unique, containers using well-known port numbers are therefore forced to use ephemeral ports. This adds complexity in network operations, network visibility, troubleshooting and deployment.

For example, the configuration of a front-end load-balancer for a DNS service hosted in a Docker cluster.

Service Address:

1.2.3.4:53

Servers:

10.1.10.1:65321
10.36.45.2:64123
10.44.3.1:54219

If you have firewalls or IDS/IPS devices behind the load-balancer, these also need to know that the DNS service is being hosted on these devices and port numbers.

IP-based

The second option (IP-based) works by assigning unique IP-Addresses to the containers and thus avoiding the need to do Port-mapping, and solving issues with downstream load-balancers and firewalls by using well-known ports in pre-determined subnets.
However, this exposes different sets of issues.

_Reachability_: Which containers are on which host?*
GCE uses a /24 per host for this reason, but solutions outside of GCE will require an overlay network like Flannel
Even a GCE style architecture will make firewall management difficult
Flexible Addressing / IP Address Management (IPAM)*
- Who assigns IP addresses to containers
  - Static? A flag in docker run?
  - DHCP/IPAM? A proper DHCP server or IPAM solution?
  - Docker? A local DHCP solution using Docker?
  - Orchestration System? via docker run or another API?
Deployability and migration concerns
- Some clouds do not play well with routers (like EC2)

Proposal

We are proposing a Native Multi-Host networking solution to Docker that handles various production-grade deployment scenarios and use cases.

The power of Docker is its simplicity, yet it scales to the demands of hyper-scale deployments. The same cannot be said today for the native networking solution in Docker. This proposal aims to bridge that gap. The intent is to implement a production-ready reliable multi-host networking solutions that is native to Docker while remaining laser focused on the user friendly needs of the developers environment that is at the heart of the Docker transformation.

The new edge of the network is the vSwitch. The virtual port density that application virtualization will drive is an even larger multiplier then the explosion of virtual ports created by OS virtualization. This will create port density far beyond anything to date. In order to scale, the network cannot be seen as merely the existing physical spine/leaf 2-tier physical network architecture but also incorporate the virtual edge. Having Docker natively incorporate clear scalable architectures will avoid the all too common problem of the network blocking innovation.

Solution Components

1. Programmable vSwitch

To implement this solution we require a programmable vSwitch.
This will allow us to configure the necessary bridges, ports and tunnels to support a wide range of networking use cases.

Our initial focus will be to develop an API to implement the primitives required of the vSwitch for multi-host networking with a focus on delivering an implementation for Open vSwitch first.

This link, WHY-OVS covers the rational for choosing OVS and why it is important to the Docker ecosystem and virtual networking as a whole. Open vSwitch has a mature Kernel Data-Plane (upstream since 3.7) with a rich set of features that addresses the requirements of mult-host. In addition to the data-plane performance and functionality, Open vSwitch also has an integrated management-plane called OVSDB that abstracts the Switch as a Database for the applications to make use of.

With this proposal the native implementation in Docker will:

Provide an API for implementing Multi-Host Networking
Provide an implementation for an Open vSwitch datapath
Implement native control plane to address the scenarios mentioned in this proposal.

2. Network Integration

The various scenarios that we will deal with in this proposal range between existing Port-Mapping solution to VXLAN based Overlays to Native underlay network-integration. There are real deployment scenarios for each of these use-cases / scenarios.

Facilitate the common application HA scenario of a service needing a 1:1 NAT mapping between the container’s back-end ip-address and a front-end IP address from a routable address pool. Alternatively, the containers can also be reachable globally depending on the users IP addressing strategy.

3. Flexible Addressing / IP Address Management (IPAM)

In a multi-host environment, IP Addressing Strategy becomes crucial. Some of the Use-cases, as we will see, will also require reasonable IPAM in place. This discussion will also lead to the production-grade scale requirements of Layer2 vs Layer3 networks.

4. Host Discovery
Though it is obvious, it is important to mention the Host Discovery requirements that is inherent for any Multi-host solution. We believe that such Host/Service Discovery mechanism is a generic requirement and is not specific to the Multi-Host networking needs and as such we are backing the Docker Clustering proposal for this purpose.

5. Multi-Tenancy
Another important consideration is to provide the architectural white-space for Multi-Tenancy solutions that may either be introduced in Docker Natively or by external orchestration systems.

Single Host Network Deployment Scenarios

Parity with existing Docker Single-Host solution

This is the native Single-Host Docker Networking model as of today. This is the most basic scenario that the solution that we are proposing must address seamlessly. This scenario brings in the basic Open vSwitch integration into Docker which we can build on top of for the Multi-Host scenarios that follows.

Figure - 1

Addition of Flexible Addressing

This scenario adds a Flexible Addressing scheme to the basic single-host use-case where we can provide IP addressing from one of many different sources

Figure - 2

Multi Host Network Deployment Scenarios

This following scenarios enables backend Docker containers to communicate with one another across multiple hosts. This fulfills the need for high availability applications to survive beyond a single node failure.

Overlay Tunnels (VXLAN, GRE, Geneve, etc.)

For environments which need to abstract the physical network, overlay networks need to create a virtual datapath using supported tunneling encapsulations (VXLAN, GRE, etc). It is just as important for these networks to be as reliable and consistent as the underlying network. Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future.

The overlay datapath is provisioned between tunnel endpoints residing in the Docker host which gives the appearance of all hosts within a given provider segment being directly connected to one another as depicted in the following Diagram 3.

Figure - 3

As a new container comes online, the prefix is updated in the routing protocol announcing its location via a tunnel endpoint. As the other Docker hosts receive the updates the forwarding is installed into OVS for which tunnel endpoint the host resides. When the host is deprovisioned, the similar process occurs and tunnel endpoint Docker hosts remove the forwarding entry for the deprovisioned container.
Underlay Network Integration

Underlay Network integration

The backend can also simply be bridged into a networks broadcast domain and rely on upstream networking to provide reachability. Traditional L2 bridging has significant scaling issues but it is still very common in many data centers with flat VLAN architectures to facilitate live workload migrations of their VMs.

This model is fairly critical for DC architectures that require a tight coupling of network and compute as opposed to a ships in the night design of overlays abstracting the physical network.

The underlay network integration can be designed with some specific network architecture in mind and hence we see models like Google Compute where every host is assigned a dedicated Subnet & each pod gets an ip-address from that subnet.

Figure - 4 - Dedicated one Static Subnet per Host*

The entire backend container space can be advertised into the underlying network for IP reachability. IPv6 is becoming attractive for many in this scenario due to v4 constraints.

By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture.

Alternatively, Underlay integration can also provide Flexible addressing combined with /32 host-updates to the network in order to provide the subnet flexibility.

Figure - 5

Summary

Implementing the above solution provides a flexible, scalable, multi-host networking as a native part of Docker. This implementation adds a strong networking foundation that is intent on providing an evolvable network architecture for the future.

The text was updated successfully, but these errors were encountered:

thockin · 2014-11-04T21:37:36Z

This sounds good. What I am not seeing is the API and performance. How does one go about setting this up? How much does it hurt performance?

One of the things we are trying to do in GCE is drive container network perf -> native. veth is awful from a perf perspective. We're working on networking (what you call underlay) without veth and a vbridge at all.

shykes · 2014-11-04T21:43:40Z

I like the idea of underlay networking in Docker. The first question is: how much can be bundled by default? Does an ovs+vxlan solution make sense as a default, in replacement of veth + regular bridge? Or should they be reserved for opt-in plugins?

@thockin do you have opinions on the best system mechanism to use?

thockin · 2014-11-04T21:46:23Z

What exactly do you mean by "system mechanism" ?

shykes · 2014-11-04T21:55:16Z

vxlan vs pcap/userland encapsulation vs nat with netfilter vs veth/bridge vs macvlan... use ovs by default vs. keep it out of the core.. Things like that.

thockin · 2014-11-04T22:04:32Z

Ah. My experience is somewhat limited.

Google has made good use of OVS internally.

veth pair performance is awful and unlikely to get better.

I have not plain with macvlan, but I understand it is ~wire speed, but a bit awkward to use.

We have a patch cooking that fills the need for macvlan-like perf without actually being VLAN (more like old-skool eth0:0 aliases).

If we're going to pick a default, I don't think OVS is the worst choice - it can't be worse perf than veth. But it's maybe more dependency heavy? Not sure.

mavenugo · 2014-11-05T01:00:03Z

@thockin @shykes Thanks for the comments.
Agreed on the veth performance issues. our proposal is to use OVS ports.
The companion proposal : #8952 covers details on how we are planning to use OVS.
(Please refer to the Open vSwitch Backend section of #8952 which covers performance details of veth vs OVS port).

OVS provides the flexibility of using VXLAN for overlay deployments or native network integration for underlay deployments without sacrificing performance or scale.

I haven't done much work with macvlan to give an answer on how it stacks up to an overall solution that includes functionality, manageability, performance, scale and network operations.

We believe that Native Docker networking solution should be flexible enough to accommodate L2, L3 and Overlay network architectures.

jainvipin · 2014-11-05T01:50:28Z

Hi Madhu, Dave and Team:

Definitely a wholesome view of the problem. Thanks for putting it out there. Few questions and comments (on both proposals [0] and [1], as they tie into each other quite a bit):

Comments and Questions on proposal on Native-Docker Multi-Host Networking:

[a] OVS Integration: The proposal is to natively instantiate ovs from docker is good.

Versioning and dependency between networking component and compute part of docker: Assuming that the driver APIs (proposed in [1]) will change and refine itself as we go. An obvious implication of such implementation inside Docker is that the docker version that implements those APIs would be tied to the user of the APIs (aka orchestrator) and all must be compatible and upgraded together.
Providing native data-path integration: If native integration of OVSDB API calls are made via docker, wouldn’t it be inefficient (extra-hop) to make these API calls via docker.
Datapath OF integration: OVS also provides a complete OF datapath using a controller (ODL, for example). Are you proposing that for a use case that requires OF API calls, the API calls are also made through docker (native integration)? Assuming not, if the datapath programming to the switch is done from outside docker, then why keep part of the OVS bridge manipulation inside docker (via the driver) and a part outside? It would seem that doing the network operations completely outside in an orchestration entity would be a good choice, provided a simple basic mechanism like [2] exists to allow the outside systems to attach network namespaces during container creation.
Provide API for implementation for Multi-Host Networking:
Question: Can you please clarify if the APIs proposed here are eventually consumed by the driver calls defined in [1]? Assuming yes, to keep docker-interface transparent to plugin-specific content of these APIs, what is the proposed method? Say, a plugin-specific parsable-network-configuration for each of the proposed API calls in [1].
Provide native control plane:
Question: Can you please elaborate the intention of this integration. Is this to allow inserting a control plane entity (aka router or routing layer, as illustrated in Figure 4 forming routing adjacency)? If so, does the entity sit inside or outside docker? The confusion comes from the bullet in section 1 “o Implement native control plane to address the scenarios mentioned in this proposal.”

[b]
+1 on the flexibility being talked about is good (single host, vs. overlays to native underlay integration). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[c]
+1 on the flexibility on IPAM (use of perhaps DHCP for certain containers vs. auto-configured for the rest, mostly useful in multi-tenant scenarios). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[e]
Multi-tenancy is an important consideration indeed; Associating a profile as in [1], specifies arbitrary parsed network configuration, seem to suffice providing a tenant context.

[f]
Regarding dns/ddns update (exposing) for the host, assuming it is done outside (orchestrator) then where part of the networking is done outside docker and part inside (rest of the native docker integration proposed here).

Comments and Questions on proposal on ‘Network Drivers’:

[g] Multiple-vNICs inside a container: Are the APIs proposed here (CreatePort) handle creation of multiple vNICs inside a container?

[h] Update to Network configuration: Say a bridge is added with a VXLAN-VNID or a VLAN, would your suggestion be to call ‘InitBridge’ or be done during PortCreate() if the VLAN/tunnel/other-parameters-needed-for-port-create does not exist.

[j] Driver API performance/scale requirements: It would be good to state an upfront design target for scale/performance.

As always, will be happy to collaborate on this with you and other developers.

Cheers,
--Vipin

[0] #8951
[1] #8952
[2] #8216

dave-tucker · 2014-11-05T02:21:25Z

@thockin on the macvlan performance, are there any published figures?
@shykes @mavenugo i've done a very rough & ready comparisons and so far OVS seems to be leading the pack in my scenario, which is iperf between two netns on the same host.
See code and environment here

from an underlay integration standpoint, I'd imagine that having a bridge would be much easier to manage as you could trunk all vlans to the vswitch and place the container port in the appropriate vlan.... otherwise with a load of mac addresses loose on your underlay you'd need to configure your underlay edge switches to apply a vlan based on a mac address (which won't be known in advance).

I feel like i'm missing something though so please feel free to correct me if i haven't quite grokked the macvlan use case

dave-tucker · 2014-11-05T02:44:18Z

@jainvipin thanks for the mega feedback. I think the answer to a lot of your questions lies in these simple statements. I firmly believe that all network configuration should be done natively, as a part of Docker. I also believe that docker run shouldn't be polluted with operational semantics, especially if this impacts the ability of docker run to be used with libswarm (e.g making assumptions on the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host, then asking Docker to plumb this in to the container doesn't seem right to me. I'd much rather see orchestration systems converge on, or create a driver in this framework (or one like it) that does the necessary configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the required primitives for programming the dataplane. This could take the form of OF datapath programming in the case of OVS, but it could also be adding plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host designed to be agnostic to the backend used to deliver them.

thockin · 2014-11-05T05:07:01Z

The caveat here is that Docker can not be everything to everyone, and the
more we try to make it do everything, the more likely it is to blow up in
our faces.

Having networking be externalized with a clean plugin interface (i.e. exec)
is powerful. Network setup isn't exactly fast-path, so popping out to an
external tool would probably be fine.

On Tue, Nov 4, 2014 at 6:44 PM, Dave Tucker notifications@github.com
wrote:

@jainvipin https://github.com/jainvipin thanks for the mega feedback. I
think the answer to a lot of your questions lies in these simple
statements. I firmly believe that all network configuration should be done
natively, as a part of Docker. I also believe that docker run shouldn't
be polluted with operational semantics, especially if this impacts the
ability of docker run to be used with libswarm (e.g making assumptions on
the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host,
then asking Docker to plumb this in to the container doesn't seem right to
me. I'd much rather see orchestration systems converge on, or create a
driver in this framework (or one like it) that does the necessary
configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the
required primitives for programming the dataplane. This could take the form
of OF datapath programming in the case of OVS, but it could also be adding
plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host
designed to be agnostic to the backend used to deliver them.

Reply to this email directly or view it on GitHub
#8951 (comment).

jainvipin · 2014-11-05T06:48:50Z

@dave-tucker There are trade-offs of pulling everything (management, data-plane, and control-plane) in docker. While you highlighted the advantages (and I agree with some as indicated in my comment), I was noting a few disadvantages (versioning/compatibility, inefficiency, docker performance, etc.) so we can weigh it better. This is based on my understanding of things reading the proposal (no experimentation yet).

In contrast, if we can incorporate a small change (#8216) in docker, it can perhaps give scheduler/orchestrator/controller a good way to spawn the containers while allowing them to do networking related things themselves, and not have to move all networking natively inside docker – IMHO a good balance for what the pain point is and yet not make docker very heavy.

'docker run' has about 20-25 options now, some of them further provides more options (e.g. ‘-a’, or ‘—security-opt’). I don’t think it will remain 25 in near/short term, and likely grow rapidly to make it a flat unstructured set. The growth would come from valid use-cases (networking or non-networking), but must we consider solving that problem here in this proposal?

I think libswarm can work with either of the two models, where an orchestrator has to play a role of spawning ‘swarmd’ with appropriate network glue points.

nkratzke · 2014-11-05T07:24:50Z

What is about weave (https://github.com/zettio/weave)? Weave provides a very convenient SDN solution for Docker from my point of view. And it provides encryption out of the box, which is a true plus. And it is the only solution with out-of-the-box encryption so far, we have found on the open source market.

Nevertheless weaves impact to network performance in HTTP based and REST-like protocols is substantial. About 30% performance loss for small message sizes (< 1000 byte) and up to 70% performance loss for big message sizes (> 200.000 bytes). Performance losses were measured for the indicators time per request, transfer rate and requests per second using apachebench against a simple ping-pong system exchanging data using a HTTP based REST-like protocol.

We are writing a paper for the next CLOSER conference to present our performance results. There are some options to optimize weave performance (e.g. not containerizing the weave router should bring 10% to 15% performance plus according to our data).

shykes · 2014-11-05T07:45:13Z

@thockin absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction :)

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

shykes · 2014-11-05T09:43:51Z

Ping @erikh

Lukasa · 2014-11-05T10:24:13Z

@dave-tucker, @mavenugo and @nerdalert (and indeed @ everyone else):

It's really exciting to see this proposal for Docker! The lack of multi-host networking has been a glaring gap in Docker's solution for a while now.

I just want to quickly propose an alternative, lighter-weight model that my colleagues and I have been working on. The OVS approach proposed here is great if it's necessary to put containers in layer 2 broadcast domains, but it's not immediately clear to me that this will be necessary for the majority of containerized workloads.

An alternative approach is pursue network virtualization at Layer 3. A good reference example is Project Calico. This approach uses BGP and ACLs to route traffic between endpoints (in this case containers). This is a much lighter-weight approach, so long as you can accept certain limitations: IP only, and no IP address overlap. Both of these feel like extremely reasonable limitations for a default Docker case.

We've prototyped Calico's approach with Docker, and it works perfectly, so the approach is simple to implement for Docker.

Docker is in a unique position to take advantage of lighter-weight approaches to virtual networking because it doesn't have the legacy weight of hypervisor approaches. It would be a shame to simply follow the path laid by hypervisors without evaluating alternative approaches.

(NB: I spotted #8952 and will comment there as well, I'd like the Calico approach to be viable for integration with Docker regardless of whether it's the default.)

erikh · 2014-11-05T11:02:32Z

I have some simple opinions here but they may be misguided, so please feel free to correct my assumptions. Sorry if this seems overly simplistic but plenty of this is very new to me, so I’ll focus on how I think this should fit into docker instead. I’m not entirely sure what you wanted me to weigh in on @shykes, so I’m trying to cover everything from a design angle.

I’ll weigh in on the nitty-gritty of the architecture after some more experimentation with openvswitch (you know, when I have a clue :).

After some consideration, I think weave, or something like it, should be the default networking system in docker. While this may ruffle some feathers, we absolutely have to support the simple use case. I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together. Weave brings this capability without a lot of dependencies at the cost of performance, and it’s very possible to embed directly into docker, with some collaborative work between us and the zettio team.

That said, openvswitch should definitely be available and first-class for production use (weave does not appear at a glance to be made especially demanding workloads) and ops professionals will appreciate the necessary complexity with the bonus flexibility. The socketplane guys seem extremely skilled and knowledgable with openvswitch and we should fully leverage that, standing on the shoulders of giants.

In general, I am all for anything that gets rid of this iptables/veth mess we have now. The code is very brittle and racy, with tons of problems, and basically makes life for ops a lot harder than it needs to be even in trivial deployments. At the end of the day, if ops teams can’t scale docker because of a poor network implementation it simply won’t get adopted in a lot of institutions.

The downside to all of this is if we execute on the above, that we have two first-class network solutions, both of which have to be meticulously maintained regularly, and devs and ops may have an impedance mismatch between dev and prod. I think that’s an acceptable trade for “it just works” on the dev side, as painful as it might end up being for docker maintainers. Ops can always create a staging environment (As they should) if they need to test network capabilities between alternatives, or help devs configure openvswitch if that’s absolutely necessary.

I would like to take plugin discussion to the relevant pull requests instead of here, I think it’s distracting from the discussion. Additionally, I don’t think the people behind the work in the plugin system are not specifically focused on networking, but instead a wider goal, so the best place to have that discussion is there.

I hope this was useful. :)

-Erik=

mavenugo · 2014-11-05T11:39:21Z

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.

From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

mavenugo · 2014-11-05T11:40:12Z

@Lukasa Please refer to a couple of important points in this proposal that exactly addresses yours :

"Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future."

"By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture."

Please refer to #8952 which provides the details on how a driver / plugin can help in choosing appropriate networking backend. I believe that is the right place to bring the discussion on including an alternative choice of another backend that will fit best in a certain scenarios.

This proposal is to explore all the multi-host networking options and exploring the Native Docker integration of those features.

mavenugo · 2014-11-05T11:56:38Z

@erikh Thanks for weighing in. Is there anything specific in the proposal that leads you to believe that it will make life of the application developer more complex ? We wanted to provide a wholesome view of the Network operations & choices in a multi-host production deployment and hence the proposal description became network operations heavy. I just wanted to assure you that It will by no way expose any complexity to the application developers.

One of the primary goals of Docker is to provide seamless and consistent mechanism from dev to production. Any impedance mismatch between dev and production should be discouraged.

+1 to "I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together."
The discussion on OVS vs Linux Bridge + IPTables is purely a infra level discussion and shouldn't impact the application developers in any way. Also that discussion should be kept under #8952.

This proposal is to bring multi-host networking Native to Docker, Transparent to Developers and Friendly to Operations.

rade · 2014-11-05T14:44:06Z

@shykes

absolutely we will need to couple this with a plugin architecture

+1

I reckon that architecturally there are three layers here...

generic docker plug-in system
networking plug-in API, sitting on top of 1)
specific implementation of 2), e.g. based on OVS, user-space, docker's existing bridge approach, our own (weave), etc.

Crucially, 2) must make as few assumptions as possible about what docker networking looks like, such as to not artificially constrain/exclude different approaches.

As a strawman for 2), how about wiring a ConfigureContainerNetworking(<container>) plug-in invocation into docker's container startup workflow just after the docker container process (and hence network namespace) has been created?

@dave-tucker Is this broadly compatible with your thinking on #8952?

MalteJ · 2014-11-05T14:47:23Z

I would like to see a simple but secure standard network solution (e.g. preventing arp spoofing. The current default config is vulnerable to this.). It should be easy to replace by something more comprehensive. And there should be an API that you can connect to your network management solution.
I don't want to put everything into docker - sounds like a big monolithic monstrosity.
I am OK with a simple default OpenVSwitch setup.
With OVS the user will find lots of documentation and has lots of configuration possibilities - if he likes to dig in.

titanous · 2014-11-05T15:13:17Z

I'd like to see this as a composable external tool that works well when wrapped up as a Docker plugin, but doesn't assume anything about the containers it is working with. There's no reason why this needs to be specific to Docker. This also will require service discovery and cluster communication to work effectively, which should be a pluggable layer.

dave-tucker · 2014-11-05T15:35:54Z

@erikh "developers don't care about openvswitch" - I agree.

Our solution is designed to be totally transparent to developers such that they can deploy their rails or postgres containers safe in the knowledge that the plumbing will be taken care of.

The other point of note here is that the backend doesn't have to be Open vSwitch - it could be whatever so long as it honours the API. You could theoretically have multi-host networking using this control plane, but linux bridge, iptables and whatever in the backend.

We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+

dave-tucker · 2014-11-05T15:36:26Z

@rade yep - philosophy is exactly the same. lets head on over to #8952 to discuss

nerdalert · 2014-11-05T18:05:30Z

Hi @MalteJ, Thanks for the feedback.
"And there should be an API that you can connect to your network management solution."

A loosely coupled management plane is definitely something that probably shouldn't affect the potential race conditions, performance or scale of deployments other then some policy float.
The basic building blocks proposed are to ensure a container can have networking provisioned with as little latency as possible which is ultimately local to the node. Once provisioned, the instance is eventually consistent with updates to its peers.
The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.

maceip · 2014-11-05T18:24:47Z

Wanted to drop in and mention an alternative to VxLAN: GUE -> an in-kernel, L3 encap solution recently (soon to be?) merged into Linux: torvalds/linux@6106253

c4milo · 2014-11-05T18:35:49Z

@maceip agreed with you. It seems to me that an efficient and minimal approach to networking in Docker would be using VXLAN + DOVE extensions or, even better, GUE. I'm inclined to think that OVS is too much for containers but I might be just biased.

maceip · 2014-11-05T18:58:08Z

Given my limited experience, I don't see a compelling reason to do anything in L2 (ovs/vxlan). Is there an argument explaining why people want this? Generic UDP Encapsulation (GUE) seems to provide a simple, performant solution to this network overlay problem, and scales across various environments/providers.

shykes · 2014-11-05T19:07:32Z

@maceip @c4milo isn't GUE super new and poorly supported in the wild? Regarding vxlan+dove, I believe OVS can be used to manage it. Do you think we would be better off hitting the kernel directly? I can see the benefits of not carrying the entire footprint of OVS if we only use a small part of it - but that should be weighed against the difficulty of writing and maintaining new code. We faced a similar tradeoff between continuing to wrap lxc, or carrying our own implementation with libcontainer. Definitely not a no-brainer either way.

erikh · 2015-01-09T14:24:29Z

We're reopening this after some discussion with @mavenugo pointing out that our proposal is not a solution for everything in here -- and it should be much closer.

We want this in docker and we don't want to communicate otherwise. So, until we can at least mostly incorporate this proposal into our new extension architecture, we will leave it open and solicit comments.

c4milo · 2015-01-09T16:02:25Z

@erikh would you mind giving us the main takeaways after your discussion with @mavenugo?

mavenugo · 2015-01-09T16:19:47Z

@c4milo following is the docker-network IRC log between us regarding reopening the proposal.

madhu: erikh: backjlack thanks for all the great work
[06:12am] madhu: on closing the proposals
[06:13am] madhu: 9983 replaces 8952 and hence closing is accurate
[06:13am] madhu: but imho 8951 should be still open because it is beyond just drivers
[06:13am] madhu: but a generic architecture for all the considerations for a multi-host scenario
[06:14am] madhu: we can close it once all the scenarios are addressed. through other proposals or through 8951
[06:14am] backjlack: madhu: Personally, I'd rather see 9983 implemented and then revisit 8951 to request an update.
[06:15am] madhu: backjlack: okay. if that is the preferred approach sure
[06:15am] erikh: gh#8951
[06:15am] erikh: hmm.
[06:15am] erikh: need to fix that.
[06:15am] confounds joined the chat room.
[06:15am] madhu: keeping it open is actually better imho
[06:15am] erikh: hmm
[06:16am] erikh: backjlack: do you have any objections to keeping it open? madhu does have a pretty good point here.
[06:16am] erikh: we can incorporate it and close it if we feel necessary later
[06:16am] madhu: exactly. that way we can easily answer the questions that are raised
[06:17am] backjlack: erikh: My main concern is that it's more of a discussion around adding OVS support.
[06:17am] erikh: hmm
[06:17am] erikh: ok. let me review and get back to you guys.
[06:17am] madhu: thanks erikh backjlack
[06:17am] madhu: backjlack: just curious. is there any trouble in keeping it open vs closed ?
[06:18am] erikh: hmm
[06:19am] erikh: the only concern I have is that with several networking proposals that we're accidentally misleading our users
[06:19am] backjlack: madhu: If it's open, people leave comments like this one: #8952 (comment)
[06:19am] backjlack: They're under the impression nobody cares about implementing that and it's very confusing.
[06:20am] erikh: hmm
[06:20am] erikh: backjlack: let's leave it open for now
[06:20am] madhu: backjlack: okay good point
[06:20am] madhu: but we were waiting on the extensions to be available
[06:20am] erikh: if we incorporate everything into the new proposal, we will close it.
[06:20am] erikh: (And we can work together to fit that goal)
[06:20am] madhu: now that we are having the momentum, there will be code backing this all up
[06:20am] jodok joined the chat room.
[06:20am] madhu: thanks erikh that would be my suggestion too
[06:21am] erikh: backjlack: WDYT? I think it's reasonable to let people know (by example) we're trying to solve the problem, even if our answers don't necessarily line up with that proposal
[06:22am] backjlack: erikh: Sure, we can reopen the issue and update the top level text to let people know this is going to be addressed after #9983 gets implemented.
[06:22am] erikh: yeah, that's a good idea.
[06:22am] erikh: madhu: can you drive updating the proposal and referencing our new one as well?
[06:23am] erikh: I'll reopen it.
[06:23am] madhu: yes sir.
[06:23am] madhu: thanks guys. appreciate it

c4milo · 2015-01-09T16:58:48Z

@mavenugo nice, thank you, it makes more sense now :)

bmullan · 2015-02-05T15:27:19Z

Related to VxLAN and the network "overlay" the stumbling block to implementation/deployment was always the requirement for multicast to be enabled in the network... which is rare.

Last year Cumulus Networks and MetaCloud open sourced VXFLD to implement VxLAN with uni-cast and UDP.

They also submitted it for consideration for consideration as a standard.

MetaCloud has since been acquired by Cisco Systems.

VXFLD consists of 2 components that work together to solve the BUM (Broadcast, unicast Unknown & Multicast) problem with VxLAN by using unicast instead of the traditional multicast.

The 2 components are called VXSND and VXRD.

VXSND provides:
unicast BUM packet flooding via the Service Node Daemon (re the SND in VXSND).
VTEP (Virtual Tunnel End-Point) "learning"

VXRD provides:
a simple Registration Daemon (the RD in VXRD) designed to register local VTEPs with a remote vxsnd daemon.

the source for VXFLD is on Github: https://github.com/CumulusNetworks/vxfld

Be sure to read the two github VXFLD directory .RST files as they describe in more detail the two daemon's for VXFLD ... VXRD and VXSND.

I thought I'd mention VXFLD as it could potentially solve part of your proposal and... the code already exists.

If you use debian or ubuntu Cumulus also has pre-packaged 3 .deb files for VXFLD:

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-common_1.0-cl2.2~1_all.deb

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxrd_1.0-cl2.2~1_all.deb

and
http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxsnd_1.0-cl2.2~1_all.deb

rcarmo · 2015-02-28T13:13:48Z

I'd like to chime in on this. I've been trying to put together a few arguments for and against doing this transparently to the user, and coming from a telco/"purist SDN" background it's hard to strike a middle ground between ease of use for small deployments and the kind of infrastructure we need to have it scale up into (and integrate with) datacenter solutions.

(I'm rather partial to the OpenVSwitch approach, really, but I understand how weave and pipework can be appealing to a lot of people)

So here are my notes:

This is just a high-level overview of how software-defined networking might work in a Docker/Swarm/Compose environment, written largely from a devops/IaaS perspective but with a fair degree of background on datacenter/telco networking infrastructure, which is fast converging towards full SDN.

There are two sides to the SDN story:

Sysadmins running Docker in a typical IaaS environment, where a lot of the networking is already provided for (and largely abstracted away) but where there's a clear need for communicating between Docker containers in different hosts.
On-premises telco/datacenter solutions where architects need deeper insight/control into application traffic or where hardware-based routing/load balancing/traffic shaping/QoS is already being enforced.

This document will focus largely on the first scenario and a set of user stories, with hints towards the second one at the bottom.

Offhand, there are two possible approaches from an end-user perspective:

Extending the CLI linking syntax and have the system build the extra bridge interfaces and tunnels "magically" (preserves the existing environment variable semantics inside containers)
Exposing networks as separate entities and make users aware of the underlying complexity (requires extra work for simple linking, may need extra environment variables to facilitate discovery, etc.).

This is largely described in http://www.slideshare.net/adrienblind/docker-networking-basics-using-software-defined-networks already, and is what pipework was designed to do.

Arguments for Keeping Things Simple (Sticking to Port Mapping)

Docker's primary networking abstraction is essentially port mapping/linking, with links exposed as environment variables to the containers involved - that makes application configuration very easy, as well as lessening CLI complexity.

Steering substantially away from that will shift the balance towards "full" networking, which is not necessarily the best way to go when you're focused on applications/processes rather than VMs.

Some IaaS providers (like Azure) provide a single network interface by default (which is then NATed to a public IP or tied to a load balancer, etc.), so the underlying transport shouldn't require extra network interfaces to work.

Arguments for Increasing Complexity (Creating Networks)

Docker does not exist in a vacuum. Docker containers invariably have to talk to services hosted in more conventional infrastructure, and Docker is increasingly being used (or at least proposed) by network/datacenter vendors as a way to package and deploy fairly low-level functionality (like traffic inspection, shaping, even routing) using solutions like OpenVSwitch and custom bridges.

Furthermore, containers can already see each other internally to a host - each is provided with a 172.17.0.0/16 IP address, which is accessible from other containers. Allowing users to define networks and bind containers to networks rather than solely ports may greatly simplify establishing connectivity between sets of containers.

Middle Ground

However, using Linux kernel plumbing (or OpenVSwitch) to provide Docker containers with what amount to fully-functional network interfaces implies a number of additional considerations (like messing with brctl) that may have unforeseen (and dangerous) consequences in terms of security, not to to mention the need to eventually deal with routing and ACLs (which are currently largely the host's concern).

On the other hand, there is an obvious need to restrict container (outbound) traffic to some extent, and a number of additional benefits that stem from providing limited visibility onto a network segment, internal or otherwise.

Minimal Requirements:

There are a few requirements that seem fairly obvious:

Docker containers should be able to talk to each other inside a swarm (i.e., a pre-defined set of hosts managed by Swarm) regardless of in which host they run.
That communication should have the least possible overhead (but, ideally, use a common enough form of encapsulation - GRE, IPoIP - that allows network teams to inspect and debug on the LAN using common, low-complexity tools)
One should be able to completely restrict outbound communications (there is a strong case to do that by default, in fact, since a compromised container may be used to generate potentially damaging traffic and affect the remainder of the infrastructure).

Improvements (Step 1):

Encrypted links when linking between Swarm hosts on open networks (which require extra setup effort)
Limiting outbound traffic from containers to specific networks or hosts (rather than outright on/off) is also desirable (but, again, require extra setup)

Further Improvements (Step 2):

Custom addressing and bridging for allowing interop with existing DC solutions
APIs for orchestrating and managing bridges, vendor interop.

Likely Approaches (none favored at this point):

Wrap OpenVSwitch (or abstract it away) into a Docker tool
Have two tiers of network support, i.e., beef up pipework (or weave) until it's easier to use and allow for custom OpenVSwitch-like solutions

mk-qi · 2015-03-20T05:24:06Z

hello everyone;

i set the docker0 in hosta and hostb in the same network via vxlan ,and it could ping each other ,but docker alawys allocate the same ip between hosta and hostb,so i wonder if there any way or plugin or hack to help me to check if the ip if exist ?

thockin · 2015-03-20T05:32:38Z

You need to pre-provision each docker0 with a different subnet range. Even
then you probably will not be able to ping across them unless you also add
your eth0 as a slave on docker0.

read this: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/

On Thu, Mar 19, 2015 at 10:24 PM, mk-qi notifications@github.com wrote:

hello everyone;

[image: docker-muilt]
https://cloud.githubusercontent.com/assets/642228/6745878/74b88210-cef9-11e4-8595-2928832ed70a.png

i set the docker0 in hosta and hostb in the same network via vxlan ,and it
could ping each other ,but docker alawys allocate the same ip between hosta
and hostb,so i wonder if there any way or plugin or hack to help me to
check if the ip if exist ?

—
Reply to this email directly or view it on GitHub
#8951 (comment).

fzansari · 2015-03-20T06:37:13Z

@mk-qi : You can use "arping" which is essentially a utility to discover if an IP is already in use within a network. Thats the way you can make sure docker does not use the same set of IPs when its "over" multiple Hosts.
Or another way is to statically assign IPs yourself to each docker

mk-qi · 2015-03-20T06:49:50Z

@thockin sorry , i has not draw the picture clearly . in fact the eth0 is the slave of docker0. and as i has said before , i can ping them on each other...

@shykes I saw your fork https://github.com/shykes/docker/tree/extensions/extensions/simplebridge it looks like it have ping ip operation before really assigning it, but i am not sure, i do not know whether you could give more information.

mk-qi · 2015-03-20T07:21:27Z

@fzansari thanks for reply , static ip allocation is ok , in fact we had useing pipwork +macvlan( +dhcp) for some small running cluster, but if running much of containers , this is very painful to manage ip, of course we can write tools, but I think hack the docker to directly Solveing the IP conflict problem , the problem will be much simpler. If this way is Possible

SamSaffron · 2015-04-30T03:20:47Z

Having just implemented keepalived internally I think there would be an enormous benefit from simply implementing an interoperable vrrp protocol. It would allow docker to "play nice" without forcing it on every machine in the network.

For example:

Host 1 (ip address 10.0.0.1):

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority 100 --network-id 10 web

Host 2 (ip address 10.0.0.2: backup service)

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority  50 --network-id 10 web

Supporting vrrp give a very clean failover story and allows you to simply assign an IP to a service. It would take a lot to flesh out the details but I do think it would be an amazing change.

cpuguy83 · 2016-04-18T19:33:40Z

Closing since multi-host networking, plugins, etc are all in since docker 1.9

dave-tucker mentioned this issue Nov 4, 2014

Proposal: Network Drivers #8952

Closed

dave-tucker mentioned this issue Jan 9, 2015

Proposal: Network Drivers #9983

Closed

erikh reopened this Jan 9, 2015

rajatchopra mentioned this issue Jan 13, 2015

Proposal: deouple networking for segmentation and other use cases kubernetes/kubernetes#3350

Closed

nerdalert unassigned erikh Feb 5, 2015

jessfraz mentioned this issue Feb 25, 2015

Feature request: Link-sharing between docker instances #3737

Closed

jessfraz added Proposal kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny and removed kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Feb 25, 2015

pierrecdn mentioned this issue Mar 5, 2015

Routing, IPv6, secondary IFs, trafic control, tunelling trial... jpetazzo/pipework#122

Open

jessfraz added the system/networking label Jul 10, 2015

nerdalert mentioned this issue Aug 10, 2015

Socketplane performance and steadibility? socketplane/socketplane#169

Open

mavenugo mentioned this issue Sep 8, 2015

Remote drivers are (wrongly) assumed to be global moby/libnetwork#486

Closed

jessfraz added area/networking and removed kind/proposal labels Sep 8, 2015

cpuguy83 closed this as completed Apr 18, 2016

botchagalupe mentioned this issue Feb 12, 2018

Proposal: Native NoCode Multi-Host Networking kelseyhightower/nocode#1636

Open

Proposal: Native Docker Multi-Host Networking #8951

Proposal: Native Docker Multi-Host Networking #8951

Comments

nerdalert commented Nov 4, 2014

Native Docker Multi-Host Networking

TL;DR Practical SDN for Docker

Background

Problem Statement

Proposal

Solution Components

Single Host Network Deployment Scenarios

Multi Host Network Deployment Scenarios

Summary

thockin commented Nov 4, 2014

shykes commented Nov 4, 2014

thockin commented Nov 4, 2014

shykes commented Nov 4, 2014

thockin commented Nov 4, 2014

mavenugo commented Nov 5, 2014

jainvipin commented Nov 5, 2014

dave-tucker commented Nov 5, 2014

dave-tucker commented Nov 5, 2014

thockin commented Nov 5, 2014

jainvipin commented Nov 5, 2014

nkratzke commented Nov 5, 2014

shykes commented Nov 5, 2014

shykes commented Nov 5, 2014

Lukasa commented Nov 5, 2014

erikh commented Nov 5, 2014

mavenugo commented Nov 5, 2014

mavenugo commented Nov 5, 2014

mavenugo commented Nov 5, 2014

rade commented Nov 5, 2014

MalteJ commented Nov 5, 2014

titanous commented Nov 5, 2014

dave-tucker commented Nov 5, 2014

dave-tucker commented Nov 5, 2014

nerdalert commented Nov 5, 2014

maceip commented Nov 5, 2014

c4milo commented Nov 5, 2014

maceip commented Nov 5, 2014

shykes commented Nov 5, 2014

erikh commented Jan 9, 2015

c4milo commented Jan 9, 2015

mavenugo commented Jan 9, 2015

c4milo commented Jan 9, 2015

bmullan commented Feb 5, 2015

rcarmo commented Feb 28, 2015

Arguments for Keeping Things Simple (Sticking to Port Mapping)

Arguments for Increasing Complexity (Creating Networks)

Middle Ground

Minimal Requirements:

Improvements (Step 1):

Further Improvements (Step 2):

Likely Approaches (none favored at this point):

mk-qi commented Mar 20, 2015

thockin commented Mar 20, 2015

fzansari commented Mar 20, 2015

mk-qi commented Mar 20, 2015

mk-qi commented Mar 20, 2015

SamSaffron commented Apr 30, 2015

cpuguy83 commented Apr 18, 2016