Decide whether/how to extend the networking model #188

bgrant0607 · 2014-06-20T16:33:37Z

Yesterday on IRC, @smarterclayton raised an issue with @jbeda and @thockin about whether other cloud providers/platforms (e.g., OpenStack, AWS) could handle the IP-per-pod networking model. I'm opening an issue to capture discussion about how to address this.

DESIGN.md doesn't really explain the motivation for the model in depth. We should capture our decision there.

IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.

OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling, is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU).

One possibility is that we could use libswarm to spin up a new VM for each pod on systems that don't have good enough routing support for allocation of an IP per pod.

smarterclayton · 2014-06-20T16:45:09Z

The VM per pod model would negate the container efficiency gains - I'm not sure how many people would be interested in deploying like that vs. simply using a VM.

Practically speaking, if routable IP per pod is out of the technical capabilities of existing clouds at reasonable densities (is the 60 IP per amazon m1.xlarge too low for practical use cases?) or out of the administrative / organizational capabilities of non-cloud shops (which deserves further investigation), and if IPv6 is still 2-3 years out from reasonable deployment, then the kubernetes model is only deployable on GCE in practice. Would be good to list out the practical limits in other clouds (openstack neutron, aws, soft layer) as well as a recommended IP per pod configuration that would work without a ton of admin headaches on metal.

It's possible that dynamic port allocation could be an alternate mode, supported with a subset of features and known limitations. What would that abstraction have to look like for Kubernetes to continue to work on the ideal path? A few things I can think of - the scheduler has to be aware of port exhaustion and record allocated ports OR the exposed ports have to be reported back via some backchannel to the master. If there is a global record of allocated ports, the mechanism to efficiently distribute that port information to the appropriate proxies is required. You must implement at least one level of abstraction between container communication (either in a local or shared proxy or an iptables NAT translation a la geard). You also must implement a more complex migration path for things like CRIU - with more steps before and afterwards to ensure that network abstraction is ready to accept the moved container.

thockin · 2014-06-24T05:16:02Z

I've started a doc on this topic, but will be out of office about half of this week.

smarterclayton · 2014-06-27T20:09:45Z

Still gathering more feedback from customers and ops folks, but there's a lot of concern about being able to deploy the IP-per-container model outside of the big cloud providers. Recording comments I'm hearing:

In most non-cloud deployments this involves setting up the necessary network configuration to make IP allocation scalable
This is expensive for operations teams
In some organizations this may be a non-starter (maybe not many, but it's tough to get configured)
If it's possible to hack this together slowly, it's less of a concern. I.e. if you start with 1-2 minions and there's some simple scripts you can run to hack together a VPC for the containers, you might be ok up to 4-5 minions. Then you need to switch to something more maintainable
A lot of ops shops expect to be in control of things like DHCP, and are leery of deploying multiple DHCP servers just to configure a special VPC. So they'd have to do the config to integrate the container use case into their existing DHCP which can be frustrating.

lexlapax · 2014-07-10T15:17:36Z

Some feedback here on the ip per pod/container model.
-IPV6 is further along in larger service providers and enterprise datacenters than we think.
-larger enterprises already use datacenter wide private ip spaces with datacenter (or row/rack/floor) wide NAT
I think rather than changing/removing the model of network addressable containers / pods, there should be other options added to allow operating in openstack or other similar clouds..
Agree with clayton, putting pods inside VMs is probably not the way to think about this.

MalteJ · 2014-07-11T10:45:50Z

I would like to get an IPv6 per pod.
Yes, it is true: lots of IaaS do not support IPv6 (digitalocean does). But if you rent a dedicated Linux server most of the times you get an IPv6 subnet for free (e.g. /64). So it would be nice to get an IPv6 or even a subnet (/112 or something) per pod and one (/128) IPv6 per container.
The next feature would be a nice firewalling/security group solution to restrict unallowed internet access to the pods or even restrict inter-pod access - e.g. for different deployment stages (dev, test, prod etc.).

bgrant0607 · 2014-07-11T23:39:29Z

Thoughts on networking and naming, including more background, partly from Tim's aforementioned doc.

Kubernetes's current networking model is described here:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/DESIGN.md#network-model
And is described in some detail in issue #15 , and below:

We start Docker with:
DOCKER_OPTS="--bridge cbr0 --iptables=false"

We set up this bridge on each node with SaltStack:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/_states/container_bridge.py

cbr0:
container_bridge.ensure:
- cidr: {{ grains['cbr-cidr'] }}
...
grains:
roles:
- kubernetes-pool
cbr-cidr: $MINION_IP_RANGE

We make these addresses routable in GCE:
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]}
--norespect_terminal_width
--project ${PROJECT}
--network ${NETWORK}
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &

The minion IP ranges are /24s in the 10-dot space.

GCE itself does not know anything about these IPs, though.

These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in #390.)

We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.

Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.

Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.

creates a new network namespace (netns) and loopback device
creates a new pair of veth devices and binds them to the netns
auto-assigns an IP from docker’s IP range
1. Create the user containers and specify their “net” argument as “container:<name of Unit test coverage in Kubelet is lousy. (~30%) #1>. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.

The net result is that all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.

In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.

It reduces isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.

When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.

This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.

An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of addressing, and would break self-registration and IP distribution mechanisms.

We want to be able to assign IP addresses externally from Docker (moby/moby#6743) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts (moby/moby#2801), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). And, we could potentially eliminate the bridge and use network interface aliases instead.

IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: moby/moby#2974, moby/moby#6923, moby/moby#6975. Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)

We'd also like to setup DNS automatically (#146). hostname, $HOSTNAME, etc. should return a name for the pod (#298), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter (moby/moby#2267), so that we don't need to keep /etc/hosts files up to date dynamically.

If we want Docker links and/or docker inspect to work, we may have work to do there. Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow. I haven't looked to see whether link variables would be set correctly, but I think there's a possibility they aren't.

We need to think more about what to do with the service proxy. Using a flat service namespace doesn't scale and environment variables don't permit dynamic updates.

We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services (#260), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks (#140) for join/leave events would probably make this even easier.

We'd even like to make pods directly routable from the external internet, though we can't do that yet. One approach could be to create a new host interface for each pod, if we had a way to route an external IP to it.

We're also working on making it possible to specify a different bridge for each container. We may or may not still need this, but it could be useful for certain scenarios:
https://botbot.me/freenode/docker-dev/2014-06-05/?msg=15716610&page=4
moby/moby#6155
moby/moby#6704

bgrant0607 · 2014-07-12T04:58:45Z

For completeness, other network-related issues:

host networking: #175
multiple bridges: #222

smarterclayton · 2014-07-16T00:06:04Z

@ironcladlou is also looking at OpenVSwitch and OpenDaylight integrations - making it easier to deploy these sorts of topologies on non-cloud infrastructure (or clouds with limited networking).

lexlapax · 2014-07-16T05:18:10Z

we did some work on docker openvswitch as proof of concept.. would be interested in those integrations -- i would even venture to say that it will be of interest to cloud providers who are looking to provide higher abstraction services ala, container / pod deployment in lieu of/in addition to plain IaaS abstractions.

Lennie · 2014-07-18T07:28:40Z

@bgrant0607 there is one thing that seems to be in conflict in your document:

It is the use a range of IP-addresses per Docker host:
"We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT."

which seems to conflict with faciliting pod migration and stable IP-addresses:

"We want to be able to assign IP addresses externally from Docker so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts, and to facilitate pod migration."

Sounds to me like you have a choice between:

IPs really are static: you end up routing /32s after you migrate a pod/container
IPs are almost always static, unless the pod/container gets migrated
IPs are pretty much dynamic anyway and you should just stop caring

bgrant0607 · 2014-07-26T05:39:12Z

@Lennie Yes, our current implementation doesn't support migration. Another possibility I'm considering is to allocate an IP per service, and use services as the stable addressing mechanism. That also would work in the case that a pod was replaced by a replicationController.

Lennie · 2014-07-26T11:45:02Z

I've been racking my brain trying to find a solution to this problem for a while now.

Have you seen how consulate and ambassadord use service discovery to connect containers ?: https://github.com/progrium/consulate https://github.com/progrium/ambassadord https://github.com/progrium/registrator (previously named docksul)

It uses 4 ideas:

the linking and ambassador pattern
environment variables can be used as meta-data
it uses a similar system SkyDock uses, which is to watch the docker-socket so it can inspect other containers being started and register them with a service discovery service.
uses iptables-redirect to send traffic to the ambassadord-proxy

What it does is, you first deploy an ambassadord container with a proxy-server that has access to the Docker-socket.

When you deploy a new container you add meta-data about what backend you want the container to link to and on which port and link it to the ambassadord container.

When the process in the new container tries to connect to it's ambassador it will connect to the IP-address of the proxy and the port of the backend it wants to connect to. iptables-redirect than sends that connection to the proxy.

The proxy can see the source IP-address and original destination port. It can then use that information to look up the meta-data of the source container or use service discovery stored in etcd or consul and connect it to an available backend which could be on an other host.

If the service has to be found through service discovery, something has to register it there. That is what the other project is for: registrator that is the one watching the containers being started/stopped.

smarterclayton · 2014-07-26T14:47:25Z

Hi Lennie - the pattern you describe is familiar to me (see #494 for a similar discussing of interconnection). I do tend to think that the ambassador should be something the infrastructure provides, rather than something modeled directly as a Docker container, but you always need a pattern that works with only docker.

Container sees a stable remote connection address (IP and environment vars)
- Use either a stable remote IP or a stable local iptables or network mapping inside each container to achieve it
Connections are defined by a higher level decision (this container should see that container)
- have a good way to define 1-N relationships globally (services) or between sets of pods (other types of services)
use local loadbalancing if possible for resilience
- must propagate logical changes (service endpoint IPs) to the host
avoid excessive proxying and use real networking where possible
- 2 hops when loadbalancing or if doing automatic TLS wrapping on direct connections
- avoid creating an abstraction on top of IP or DNS - instead use those abstractions where possible

I think most of the pieces are in place in Kubernetes, but id like to see something like the consulate pattern.

Lennie · 2014-07-26T21:27:43Z

I believe the pattern I described is actually meant to be a Docker container that is part of the infrastructure.

For one it get access to the Docker-socket, you probably don't want to do that if it is not a trusted part of the infrastructure. And the step for setting up the iptables rules that are applied to the container need --privileged too.

Have a look at the interview with the author:
http://progrium.com/blog/2014/07/25/building-the-future-with-docker/

The reason it is a Docker container is so it can be linked to and if everything is a container it makes it easier to deploy it.

That ambasssador is a local proxy which handles the loadbalancing and watching for changes service changes.

On 1 and 4, I totally agree on using real networking.

The question is:

do you want the infrastructure to provide the service discovery, reconfiguration, loadbalacing or do you want the developer of the container that is being deployed to handle it - probably the first and in that case, you'll need a local proxy.

smarterclayton · 2014-07-26T22:15:15Z

Sorry, I meant as a single ambassador container per remote link (should have clarified). Running the infrastructure proxy as a container makes sense for certain. And links v2 in docker is moving towards the idea of a discovery hub on the host that can be externally configured, as another potential container listening on libchan.

The service proxy on each minion is implementing much of this pattern today, although I'd like to additionally offer the ability for the ips of pods that correspond to a label query to be late bound as environment variables at container start time (to allow direct connection) as well as create local virtual network links per container for singletons (for ease of development, 127.0.0.1:3306 in your container points to either a local service proxy or a pod ip) that can be late bound dynamically.

Lennie · 2014-07-27T00:36:59Z

That is why I mentioned consul, because it has atleast 3 of these properties:

it uses one proxy to handle all communication from the host to other hosts (and local backends of course). Obviously, you can still use more of these proxies if you have many containers.
it already uses environment variables as meta-data.
it also uses container linking so that way it supports the model for local development with linking containers on one machine without using a service discovery and proxy.

Lennie · 2014-07-27T01:04:08Z

So far the only thing it doesn't do, but I haven't seen anyone implement or even mention it, is to implement multi-host inter-container communication firewalling with iptables (--icc=false). Almost the complete opposite of direct connections.

That doesn't mean I haven't thought about it.

And my idea right now is, maybe it can be done.

If all containers use the proxy to talk to other containers, then the containers on the source host shouldn't be able to talk to anything else but the proxy. This is almost what icc=false does right now when you use linking.

If you have that, then all you need to do is set up a firewall rule on the destination host with a set of IP-addresses of source hosts. And you can actually use a pretty efficient ipset for that. Maybe even just 1 ipset.

Obviously, in the current model where every pod as an IP-address that list might grow pretty large.

Docker has 1 mode how to publish a port on it's public IP-address: publish. Expose only has an effect locally. But you could have a third, with 1 iptables rule with that ipset. You can even have something like consul in a Docker container with --host=net automatically manage that ipset based something like service discovery only it will just have a list of all the hosts of this deployment.

One of the reasons why I would like to see something like that. Is because I want some kind of multitenancy. Not of different customers, but different deployments different or the same applications from the same customer/developer/user.

Yeah, I know maybe that is just crazy talk. :-)

smarterclayton · 2014-07-27T01:47:23Z

That matches with at least some of the plans we have to enable MT in Kubernetes - it's probably just a matter of time before someone takes a stab at it.

MalteJ · 2014-09-24T10:21:49Z

I am currently working on a docker IPv6 implementation. The first tests look good. Every docker host has a subnet from which it delegates one IPv6 to each container.
The next step is to write some documentation and break it down into smaller chunks to create nice pull requests and get it into the docker master branch.

bgrant0607 · 2014-09-25T03:42:38Z

I think this issue is pretty well covered by other issues, such as #494, #146, #1107, #1261, #1307, #1437.

regenerate files for 2018 year

Make sure the cursor is a string in JSON

…ravis_ci support go 1.10 in travis ci

jbeda added the question label Jun 20, 2014

bgrant0607 mentioned this issue Jul 11, 2014

Automatic port number assignment #390

Closed

bgrant0607 mentioned this issue Jul 16, 2014

Documentation additions and refactoring. #481

Merged

bgrant0607 modified the milestone: P0 Jul 16, 2014

erictune added the networking label Jul 24, 2014

bgrant0607 added this to the v1.0 milestone Aug 27, 2014

bgrant0607 mentioned this issue Aug 28, 2014

WIP: ip per service #1038

Closed

bgrant0607 mentioned this issue Sep 25, 2014

Support and/or exploit ipv6 #1443

Closed

bgrant0607 closed this as completed Sep 25, 2014

rphillips added a commit to rphillips/kubernetes that referenced this issue Jan 9, 2018

Merge pull request kubernetes#188 from rphillips/1.8.6/add_new_year

69a482f

regenerate files for 2018 year

k8s-ci-robot mentioned this issue Jan 9, 2018

fix '--user' flag conflict in create rolebinding command #57986

Closed

b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021

Merge pull request kubernetes#188 from eyakubovich/cursor-fix

5b54004

Make sure the cursor is a string in JSON

pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022

Alter 'cp' step in cloudbuild.yaml (kubernetes#188)

8965108

linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024

Merge pull request kubernetes#188 from andyxning/support_go_1.10_in_t…

6c1026a

…ravis_ci support go 1.10 in travis ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide whether/how to extend the networking model #188

Decide whether/how to extend the networking model #188

bgrant0607 commented Jun 20, 2014

smarterclayton commented Jun 20, 2014

thockin commented Jun 24, 2014

smarterclayton commented Jun 27, 2014

lexlapax commented Jul 10, 2014

MalteJ commented Jul 11, 2014

bgrant0607 commented Jul 11, 2014

bgrant0607 commented Jul 12, 2014

smarterclayton commented Jul 16, 2014

lexlapax commented Jul 16, 2014

Lennie commented Jul 18, 2014

bgrant0607 commented Jul 26, 2014

Lennie commented Jul 26, 2014

smarterclayton commented Jul 26, 2014

Lennie commented Jul 26, 2014

smarterclayton commented Jul 26, 2014

Lennie commented Jul 27, 2014

Lennie commented Jul 27, 2014

smarterclayton commented Jul 27, 2014

MalteJ commented Sep 24, 2014

bgrant0607 commented Sep 25, 2014

Decide whether/how to extend the networking model #188

Decide whether/how to extend the networking model #188

Comments

bgrant0607 commented Jun 20, 2014

smarterclayton commented Jun 20, 2014

thockin commented Jun 24, 2014

smarterclayton commented Jun 27, 2014

lexlapax commented Jul 10, 2014

MalteJ commented Jul 11, 2014

bgrant0607 commented Jul 11, 2014

bgrant0607 commented Jul 12, 2014

smarterclayton commented Jul 16, 2014

lexlapax commented Jul 16, 2014

Lennie commented Jul 18, 2014

bgrant0607 commented Jul 26, 2014

Lennie commented Jul 26, 2014

smarterclayton commented Jul 26, 2014

Lennie commented Jul 26, 2014

smarterclayton commented Jul 26, 2014

Lennie commented Jul 27, 2014

Lennie commented Jul 27, 2014

smarterclayton commented Jul 27, 2014

MalteJ commented Sep 24, 2014

bgrant0607 commented Sep 25, 2014