Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide whether/how to extend the networking model #188

Closed
bgrant0607 opened this issue Jun 20, 2014 · 20 comments
Closed

Decide whether/how to extend the networking model #188

bgrant0607 opened this issue Jun 20, 2014 · 20 comments
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bgrant0607
Copy link
Member

Yesterday on IRC, @smarterclayton raised an issue with @jbeda and @thockin about whether other cloud providers/platforms (e.g., OpenStack, AWS) could handle the IP-per-pod networking model. I'm opening an issue to capture discussion about how to address this.

DESIGN.md doesn't really explain the motivation for the model in depth. We should capture our decision there.

IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.

OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling, is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU).

One possibility is that we could use libswarm to spin up a new VM for each pod on systems that don't have good enough routing support for allocation of an IP per pod.

@smarterclayton
Copy link
Contributor

The VM per pod model would negate the container efficiency gains - I'm not sure how many people would be interested in deploying like that vs. simply using a VM.

Practically speaking, if routable IP per pod is out of the technical capabilities of existing clouds at reasonable densities (is the 60 IP per amazon m1.xlarge too low for practical use cases?) or out of the administrative / organizational capabilities of non-cloud shops (which deserves further investigation), and if IPv6 is still 2-3 years out from reasonable deployment, then the kubernetes model is only deployable on GCE in practice. Would be good to list out the practical limits in other clouds (openstack neutron, aws, soft layer) as well as a recommended IP per pod configuration that would work without a ton of admin headaches on metal.

It's possible that dynamic port allocation could be an alternate mode, supported with a subset of features and known limitations. What would that abstraction have to look like for Kubernetes to continue to work on the ideal path? A few things I can think of - the scheduler has to be aware of port exhaustion and record allocated ports OR the exposed ports have to be reported back via some backchannel to the master. If there is a global record of allocated ports, the mechanism to efficiently distribute that port information to the appropriate proxies is required. You must implement at least one level of abstraction between container communication (either in a local or shared proxy or an iptables NAT translation a la geard). You also must implement a more complex migration path for things like CRIU - with more steps before and afterwards to ensure that network abstraction is ready to accept the moved container.

@jbeda jbeda added the question label Jun 20, 2014
@thockin
Copy link
Member

thockin commented Jun 24, 2014

I've started a doc on this topic, but will be out of office about half of this week.

@smarterclayton
Copy link
Contributor

Still gathering more feedback from customers and ops folks, but there's a lot of concern about being able to deploy the IP-per-container model outside of the big cloud providers. Recording comments I'm hearing:

  • In most non-cloud deployments this involves setting up the necessary network configuration to make IP allocation scalable
  • This is expensive for operations teams
  • In some organizations this may be a non-starter (maybe not many, but it's tough to get configured)
  • If it's possible to hack this together slowly, it's less of a concern. I.e. if you start with 1-2 minions and there's some simple scripts you can run to hack together a VPC for the containers, you might be ok up to 4-5 minions. Then you need to switch to something more maintainable
  • A lot of ops shops expect to be in control of things like DHCP, and are leery of deploying multiple DHCP servers just to configure a special VPC. So they'd have to do the config to integrate the container use case into their existing DHCP which can be frustrating.

@lexlapax
Copy link

Some feedback here on the ip per pod/container model.
-IPV6 is further along in larger service providers and enterprise datacenters than we think.
-larger enterprises already use datacenter wide private ip spaces with datacenter (or row/rack/floor) wide NAT
I think rather than changing/removing the model of network addressable containers / pods, there should be other options added to allow operating in openstack or other similar clouds..
Agree with clayton, putting pods inside VMs is probably not the way to think about this.

@MalteJ
Copy link

MalteJ commented Jul 11, 2014

I would like to get an IPv6 per pod.
Yes, it is true: lots of IaaS do not support IPv6 (digitalocean does). But if you rent a dedicated Linux server most of the times you get an IPv6 subnet for free (e.g. /64). So it would be nice to get an IPv6 or even a subnet (/112 or something) per pod and one (/128) IPv6 per container.
The next feature would be a nice firewalling/security group solution to restrict unallowed internet access to the pods or even restrict inter-pod access - e.g. for different deployment stages (dev, test, prod etc.).

@bgrant0607
Copy link
Member Author

Thoughts on networking and naming, including more background, partly from Tim's aforementioned doc.

Kubernetes's current networking model is described here:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/DESIGN.md#network-model
And is described in some detail in issue #15 , and below:

We start Docker with:
DOCKER_OPTS="--bridge cbr0 --iptables=false"

We set up this bridge on each node with SaltStack:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/cluster/saltbase/salt/_states/container_bridge.py

cbr0:
container_bridge.ensure:
- cidr: {{ grains['cbr-cidr'] }}
...
grains:
roles:
- kubernetes-pool
cbr-cidr: $MINION_IP_RANGE

We make these addresses routable in GCE:
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]}
--norespect_terminal_width
--project ${PROJECT}
--network ${NETWORK}
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &

The minion IP ranges are /24s in the 10-dot space.

GCE itself does not know anything about these IPs, though.

These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in #390.)

We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.

Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.

  1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.
  • creates a new network namespace (netns) and loopback device
  • creates a new pair of veth devices and binds them to the netns
  • auto-assigns an IP from docker’s IP range
    1. Create the user containers and specify their “net” argument as “container:<name of Unit test coverage in Kubelet is lousy. (~30%) #1>. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.

The net result is that all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.

In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.

It reduces isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.

When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.

This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.

An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of addressing, and would break self-registration and IP distribution mechanisms.

We want to be able to assign IP addresses externally from Docker (moby/moby#6743) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts (moby/moby#2801), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below). And, we could potentially eliminate the bridge and use network interface aliases instead.

IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: moby/moby#2974, moby/moby#6923, moby/moby#6975. Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)

We'd also like to setup DNS automatically (#146). hostname, $HOSTNAME, etc. should return a name for the pod (#298), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter (moby/moby#2267), so that we don't need to keep /etc/hosts files up to date dynamically.

If we want Docker links and/or docker inspect to work, we may have work to do there. Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow. I haven't looked to see whether link variables would be set correctly, but I think there's a possibility they aren't.

We need to think more about what to do with the service proxy. Using a flat service namespace doesn't scale and environment variables don't permit dynamic updates.

We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services (#260), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks (#140) for join/leave events would probably make this even easier.

We'd even like to make pods directly routable from the external internet, though we can't do that yet. One approach could be to create a new host interface for each pod, if we had a way to route an external IP to it.

We're also working on making it possible to specify a different bridge for each container. We may or may not still need this, but it could be useful for certain scenarios:
https://botbot.me/freenode/docker-dev/2014-06-05/?msg=15716610&page=4
moby/moby#6155
moby/moby#6704

@bgrant0607
Copy link
Member Author

For completeness, other network-related issues:

host networking: #175
multiple bridges: #222

@smarterclayton
Copy link
Contributor

@ironcladlou is also looking at OpenVSwitch and OpenDaylight integrations - making it easier to deploy these sorts of topologies on non-cloud infrastructure (or clouds with limited networking).

@lexlapax
Copy link

we did some work on docker openvswitch as proof of concept.. would be interested in those integrations -- i would even venture to say that it will be of interest to cloud providers who are looking to provide higher abstraction services ala, container / pod deployment in lieu of/in addition to plain IaaS abstractions.

@bgrant0607 bgrant0607 modified the milestone: P0 Jul 16, 2014
@Lennie
Copy link

Lennie commented Jul 18, 2014

@bgrant0607 there is one thing that seems to be in conflict in your document:

It is the use a range of IP-addresses per Docker host:
"We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT."

which seems to conflict with faciliting pod migration and stable IP-addresses:

"We want to be able to assign IP addresses externally from Docker so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts, and to facilitate pod migration."

Sounds to me like you have a choice between:

  • IPs really are static: you end up routing /32s after you migrate a pod/container
  • IPs are almost always static, unless the pod/container gets migrated
  • IPs are pretty much dynamic anyway and you should just stop caring

@bgrant0607
Copy link
Member Author

@Lennie Yes, our current implementation doesn't support migration. Another possibility I'm considering is to allocate an IP per service, and use services as the stable addressing mechanism. That also would work in the case that a pod was replaced by a replicationController.

@Lennie
Copy link

Lennie commented Jul 26, 2014

I've been racking my brain trying to find a solution to this problem for a while now.

Have you seen how consulate and ambassadord use service discovery to connect containers ?: https://github.com/progrium/consulate https://github.com/progrium/ambassadord https://github.com/progrium/registrator (previously named docksul)

It uses 4 ideas:

  • the linking and ambassador pattern
  • environment variables can be used as meta-data
  • it uses a similar system SkyDock uses, which is to watch the docker-socket so it can inspect other containers being started and register them with a service discovery service.
  • uses iptables-redirect to send traffic to the ambassadord-proxy

What it does is, you first deploy an ambassadord container with a proxy-server that has access to the Docker-socket.

When you deploy a new container you add meta-data about what backend you want the container to link to and on which port and link it to the ambassadord container.

When the process in the new container tries to connect to it's ambassador it will connect to the IP-address of the proxy and the port of the backend it wants to connect to. iptables-redirect than sends that connection to the proxy.

The proxy can see the source IP-address and original destination port. It can then use that information to look up the meta-data of the source container or use service discovery stored in etcd or consul and connect it to an available backend which could be on an other host.

If the service has to be found through service discovery, something has to register it there. That is what the other project is for: registrator that is the one watching the containers being started/stopped.

@smarterclayton
Copy link
Contributor

Hi Lennie - the pattern you describe is familiar to me (see #494 for a similar discussing of interconnection). I do tend to think that the ambassador should be something the infrastructure provides, rather than something modeled directly as a Docker container, but you always need a pattern that works with only docker.

  1. Container sees a stable remote connection address (IP and environment vars)
    • Use either a stable remote IP or a stable local iptables or network mapping inside each container to achieve it
  2. Connections are defined by a higher level decision (this container should see that container)
    • have a good way to define 1-N relationships globally (services) or between sets of pods (other types of services)
  3. use local loadbalancing if possible for resilience
    • must propagate logical changes (service endpoint IPs) to the host
  4. avoid excessive proxying and use real networking where possible
    • 2 hops when loadbalancing or if doing automatic TLS wrapping on direct connections
    • avoid creating an abstraction on top of IP or DNS - instead use those abstractions where possible

I think most of the pieces are in place in Kubernetes, but id like to see something like the consulate pattern.

@Lennie
Copy link

Lennie commented Jul 26, 2014

I believe the pattern I described is actually meant to be a Docker container that is part of the infrastructure.

For one it get access to the Docker-socket, you probably don't want to do that if it is not a trusted part of the infrastructure. And the step for setting up the iptables rules that are applied to the container need --privileged too.

Have a look at the interview with the author:
http://progrium.com/blog/2014/07/25/building-the-future-with-docker/

The reason it is a Docker container is so it can be linked to and if everything is a container it makes it easier to deploy it.

That ambasssador is a local proxy which handles the loadbalancing and watching for changes service changes.

On 1 and 4, I totally agree on using real networking.

The question is:

do you want the infrastructure to provide the service discovery, reconfiguration, loadbalacing or do you want the developer of the container that is being deployed to handle it - probably the first and in that case, you'll need a local proxy.

@smarterclayton
Copy link
Contributor

Sorry, I meant as a single ambassador container per remote link (should have clarified). Running the infrastructure proxy as a container makes sense for certain. And links v2 in docker is moving towards the idea of a discovery hub on the host that can be externally configured, as another potential container listening on libchan.

The service proxy on each minion is implementing much of this pattern today, although I'd like to additionally offer the ability for the ips of pods that correspond to a label query to be late bound as environment variables at container start time (to allow direct connection) as well as create local virtual network links per container for singletons (for ease of development, 127.0.0.1:3306 in your container points to either a local service proxy or a pod ip) that can be late bound dynamically.

@Lennie
Copy link

Lennie commented Jul 27, 2014

That is why I mentioned consul, because it has atleast 3 of these properties:

  • it uses one proxy to handle all communication from the host to other hosts (and local backends of course). Obviously, you can still use more of these proxies if you have many containers.
  • it already uses environment variables as meta-data.
  • it also uses container linking so that way it supports the model for local development with linking containers on one machine without using a service discovery and proxy.

@Lennie
Copy link

Lennie commented Jul 27, 2014

So far the only thing it doesn't do, but I haven't seen anyone implement or even mention it, is to implement multi-host inter-container communication firewalling with iptables (--icc=false). Almost the complete opposite of direct connections.

That doesn't mean I haven't thought about it.

And my idea right now is, maybe it can be done.

If all containers use the proxy to talk to other containers, then the containers on the source host shouldn't be able to talk to anything else but the proxy. This is almost what icc=false does right now when you use linking.

If you have that, then all you need to do is set up a firewall rule on the destination host with a set of IP-addresses of source hosts. And you can actually use a pretty efficient ipset for that. Maybe even just 1 ipset.

Obviously, in the current model where every pod as an IP-address that list might grow pretty large.

Docker has 1 mode how to publish a port on it's public IP-address: publish. Expose only has an effect locally. But you could have a third, with 1 iptables rule with that ipset. You can even have something like consul in a Docker container with --host=net automatically manage that ipset based something like service discovery only it will just have a list of all the hosts of this deployment.

One of the reasons why I would like to see something like that. Is because I want some kind of multitenancy. Not of different customers, but different deployments different or the same applications from the same customer/developer/user.

Yeah, I know maybe that is just crazy talk. :-)

@smarterclayton
Copy link
Contributor

That matches with at least some of the plans we have to enable MT in Kubernetes - it's probably just a matter of time before someone takes a stab at it.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Aug 27, 2014
@MalteJ
Copy link

MalteJ commented Sep 24, 2014

I am currently working on a docker IPv6 implementation. The first tests look good. Every docker host has a subnet from which it delegates one IPv6 to each container.
The next step is to write some documentation and break it down into smaller chunks to create nice pull requests and get it into the docker master branch.

@bgrant0607
Copy link
Member Author

I think this issue is pretty well covered by other issues, such as #494, #146, #1107, #1261, #1307, #1437.

rphillips added a commit to rphillips/kubernetes that referenced this issue Jan 9, 2018
b3atlesfan pushed a commit to b3atlesfan/kubernetes that referenced this issue Feb 5, 2021
Make sure the cursor is a string in JSON
pjh pushed a commit to pjh/kubernetes that referenced this issue Jan 31, 2022
linxiulei pushed a commit to linxiulei/kubernetes that referenced this issue Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

8 participants