-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: deouple networking for segmentation and other use cases #3350
Comments
This makes a lot of sense to me, and matches many use cases we believe exist for real world deployments at scale. One fundamental question is how disjoint networks with overlapping ip space require changes to core Kube assumptions. I believe at a minimum within a namespace or related sets of namespaces we should continue to require ips to be unique, and look to how across namespaces we can define separate regions of behavior. |
@smarterclayton - agree; keeping the IPs unique within a namespace makes sense, and define separate regions of behavior within each namespace. I don't think the code changes are trivial; you along with other major contributors can comment more. Even if refactoring the code is bigger change, IMO the long term benefits of an API based interface for network functionality is worth it for future-proofing Kubernetes. In fact, it is inline with the design principles of Kubernetes, that clearly defines discrete functions that plug in to make a solution. |
If k8s-allocated service IPs are not by default unique across namespaces then we would want to have a way to allocate ones that are. Something along the lines of |
It's possible service IPs can be the exception to the rule because they are "virtual" - if you wanted to use a real service IP (via some special integration mechanism), we'd want to change how service IPs are allocated from "blocking, inside POST" to "controller driven via watch, extensible". Other code would then need to handle "service doesn't have IP yet" which doesn't seem onerous. With that in mind you'd be able to build integrations that do IP allocation to real systems for performance, or for more advanced virtual IPs (not iptables) via SDN controllers that take into account what flows you need to be part of. ----- Original Message -----
|
Agree with all the use cases listed in this proposal. Proposed hooksProposing a basic call-out api (southbound) from kubernetes, where 'sdn' is a packaged plugin or an interface that is free to call out binary if it chooses to.
Integration examplesA table on how some solutions could use the above APIs
The OVS simple/complex solutions have been borrowed from here. |
Why are these not simply watching Kubernetes and making the appropriate changes to other resources? And why not just pass this info into docker at network creation time?
|
More concretely, invert this so you're observing Kube and putting the right information in the right place. We don't use the "send notification this happened" for the core code paths so we need to change the pattern of the info you need to do the right thing at the network layer. E.g.
A better bet may be to plugin to the kubelet or Docker and block startup until that information becomes available. It's certainly possible for you to annotate the pods themselves and read that data, but you may need a side channel. It'd be best to demonstrate the concrete steps we need for each piece (assignments, mappings, and flow of info) so that we can find the solution that is going to preserve the core flows in place today.
|
@rajatchopra, @smarterclayton Instead of jumping to the actual API, may I suggest that we discuss two things [1] Design goals of such an API:
[2] Trigger point requirements for such an API (to keep things simple and consistent, we can assume they are triggered on both master/kubelet)
Of course there is reality of the code:
Like others, I have many more things to discuss. However, IMO it is most important to get consensus about the design goals and requirements first. And of course my original question in the beginning - if Kubernetes designers also think this is a good direction: @thockin, @brendanburns, @lavalamp, @jbeda ?. |
Can we change the name of this to "Proposal: support network segmentation"? Fixing up scheduling would be a relatively minor requirement. If we want to support network segmentation, I think it's logical that namespaces are the units that you place into a network segment. (so multiple namespaces could be on the same segment, but a single namespace can't be split across segments.) I also think that each kubelet should be in only one network segment, which makes everything way easier. I see some discussion about k8s components making api calls to network fabric implementations every time a pod starts. That doesn't sound scalable to me. Some evolution of our current system where each kubelet gets a range of IP addresses is probably more feasible. I could maybe see doing a call out to a network fabric to request an IP range for a node when it joins the cluster. The other part of this is defining some way of specifying how bridging happens, so you can use a service in another namespace. I think it'll be clear how to do that once we get the service object hashed out, it may be a simple as just making "external" service localNS/foo that points to remoteNS/serviceFoo's public IP(s). Scheduling should be as simple as putting a segment label (or something similar) on nodes. |
We have previously talked about partitioning kubelets into groups for reasons not specifically related to networking (such as QoS). We should think about how network segments would interact with partitions. |
I'm not sure how well it will work to require kubelets to be in only one segment.
|
I guess it depends on how much stuff will be running in a single namespace. At any rate, it is probably possible to make kubelet multi-segment-aware in the future. But it's probably easier to get it working/designed initially without that requirement. |
I think we have a concrete use case today in OpenShift (for which we're going to do extra work obviously) that would have many segments per kubelet. For dev style environments you want aggressive encapsulation, and you'd run out of kubelets before you ran out of people who just want to run a 128mb web server that uses 0.1% of the CPU. ----- Original Message -----
|
agree with @smarterclayton. My requirement also fits multiple segments per kubelet much better - segments are very disposable and scales high. To avoid scale challenge it is best to support multiple segments per kubelet. In fact some people rightfully use the term 'micro-segmentation' to describe the use cases. @lavalamp: How about if we remove scheduling from the proposal title, and keep it 'Proposal: decouple networking for wider set of use cases' |
@rajatchopra
Just food for some thought: How safe is it to program a Open vSwitch in a VM where containers run? A privileged container can easily change tunnel-id or flows and access traffic from other tenants. IMO, security of containers is not yet completely established to trust them not to break out and change OVS in a VM. I think a better way to be considered is to tag traffic in a VM and create tunnels in hypervisor. Not sure Kubernetes can handle that (so worst case is same containers from same subnet will get effected but not containers from other tenants). |
@shettyg I think that security should be a separate concern. If one is talking about multi-tenancy then they are already bought into the security of containers (however secure or not secure they may be). @jainvipin Agree that the title of the proposal should better reflect the discussion. Also, to note here is that we have a working prototype (read hacked together) for multi-tenancy in kubernetes and the proposal that @rajatchopra are from that POC. It will be good to get feedback and agreement on the overall design to accommodate different networking solutions. @smarterclayton Here are some concrete examples for the items --
|
@mrunalp
Let me clarify. What I am saying is that you can achieve multi-tenancy securely for containers, if you do the following.
If a container does break out in the VM, the worst case is that it will change the VLAN tag. So the hypervisor can believe a traffic coming out of a container is from a different container of the same VM. But it will only effect a single tenant. My question for everyone is whether Kubernetes architecture allows it to make changes in the hypervisor that hosts a VM. (I have really no idea how Kubernetes internals work). Is this model even workable with the Kubernetes? |
@mrunalp directed me here from I would like to run an application under Kubernetes that by design needs to be attached to the same L2 broadcast domain as some devices that it will manage (because it uses a broadcast discovery mechanism). If this were the only service I were running I would simply add the appropriate physical NIC to the default docker bridge and it would Just Work... ...but it's not the only contained service I want to run, and other services don't need to be/shouldn't be attached to the same physical network. I would like a way to define "networks" to Kubernetes such that pods can request attachment to specific segments. I imagine that this would work by moving network configuration out of Docker and into Kubernetes, such that the network namespace container would be created with The hook proposal from @rajatchopra is sort of what I was thinking about:
|
@larsks - completely agree; in fact what you say describes the use case I tried describing in the proposal. And with hooks provided for this, it would be possible for an external entity to configure networking. This works inline with docker's network drivers proposal it fits really well where network can be flexibly configured for the use cases described in this proposal and what you describe. As @erictune pointed out in #3585 it would make sense to prototype and learn more about the best possible ways to provide this flexibility in Kubernetes. I am playing with some code as we speak... |
How about a community hangout next Friday (20th Feb) to discuss networking plugins? |
I have scheduled a public hangout https://plus.google.com/events/cfgii4a4qgu5lhpptlgs1ll23o8 Does that work? |
Edited to move this earlier to 10:00 a.m. to accommodate more folks. @jainvipin, does that work for you? |
@mrunalp - works for me... |
Was the event moved to 9 AM ? |
I have to apologise. Yes. Was re-arranged to 9am, but I did not put up the final time in this thread. There was a parallel email chain (on google-containers at googlegroups) about scheduling this and I missed that everybody may not be on that group. Summary/action-items of the hangout:
@thockin, kindly add/edit any minutes that I missed. |
@rajatchopra - I do want to bring up an important point that I brought up in the meeting as well:
The difference is big. If we target the use cases, we will cut the hooks/APIs that will cater to the use cases; Flannel/Weave/Openshift-sdn are ways to achieve those use cases. I'd also put OVS to be consumed by tools/solutions like flannel/weave/openshift-sdn, instead being in the same category. |
I just want to add that I'm supportive of the direction discussed here. Kubernetes should absolutely allocate network resources and orchestrate network configuration on behalf of applications. |
I am supportive of the direction in which this is going. I want to bring up one more use case - of deploying a load balancer like haxproxy or nginx within the cluster (instead of using the service proxy). Assume there is a pool of IP addresses (may be a /24 subnet) dedicated for external connectivity (Public IPs) for the cluster. Each such load balancer needs to be assigned an IP from this pool. If there are multiple load balancers supporting a service, assume DNS load balancing takes care of that. This brings in two concepts that Kubernetes doesn't support today.
Happy to write up a formal proposal and contribute code if there is interest. |
+1 on scheduler to be network/storage resource aware, that sometimes are cluster-wide and not host specific. Specifically for IPAM, we discussed this when coming up with minimal/extended APIs under this proposal. It may not be the absolute desired thing as long as the API can return an error and thus pod is prevented from getting scheduled. Generically, the ability to evaluate a resource that is shared across nodes and corresponding code changes might need a separate discussion thread. |
Proposing the most basic needs/changes.
This indicates that the 'network providers' will have to be run as separate services, and not compiled with kube code. |
the cloud provider is compiled into the apiserver and set via a flag on the apiserver ( I think minions are watchable, but if they aren't, agree that they should be. minions aren't in namespaces, so maybe you don't need to watch namespaces? annotations are already on the pod. should work. Not following logic of why network providers have to be services. |
Yes. Thanks for the pointer.
Yes. They are watchable, I was looking at a really old checkout. Namespaces are a separate thing, and apparently the watch api exists there also. Some network providers may be acting on namespaces appearing/disappearing.
By services, I just meant separate processes/daemons. None of the 'network provider' daemons will be invoked by any of the kube processes. The alternative would have been to start the registered network-provider daemon upon Init of apiserver/kubelet. |
That is what I thought you meant by services. A cloud provider in kubernetes is a compiled-in bit of go code. It may in turn talk to a web service like GCE. I'm wondering if/why network providers are different. |
Separate binary for two southbound hooks from kubelet mentioned above, will require kubelet to do active-wait, unless state transitions are asynchronously handled i.e. more code for async handling.
|
Assuming a network provider watches Nodes, what is the relationship between a Node being in a Ready status and the network actually being setup? Today a NodeController inserts a Node, and once the Kubelet reports an ok health check, the node is considered Ready. Should a Node really be ready if it's network is not yet setup? If I understand what is proposed how will the Kubelet know its ready? Sent from my iPhone
|
@erictune said:
We could create interfaces and compile in network provider implementations, but from the discussions we have had so far it seemed like there was more inclination towards keeping the network providers independent. It could certainly be changed if we feel that is the wrong direction. |
@erictune It will be difficult to standardize on the interface, if we were to choose a compile-in option. There are just too many possible aspects about the network (L2/L3/firewall/encap etc) and different solutions may have completely non-overlapping needs. It will be difficult to get a minimum viable set e.g. vxlan vs sriov vs ipvlan. @derekwaynecarr We may want a sync mechanism in the future, and finalizers is one approach. i.e. kubelet does not report itself ready unless the minion resource has certain finalizer flags set alright. Nevertheless, we would want the 'network provider' to be ready and resilient to the lifecycle of the kubelet. |
FYI: I am trying to fix terminology. Initializers deal with making something ready. Finalizers deal with doing what is needed before an object is allowed to be deleted. Sent from my iPhone
|
I apologize for not responding to this thread before. I have been reading along and chewing on it. I went back through the whole thing today and here are some notes. First, I very much agree this is useful and possible. I think the approach seems OK. I admit that I am NOT familiar enough with some of the deep networking stuff to know how ti works and what it needs, so I apologize for any dumb questions. Second, I think it is important to keep this as simple as possible and free from details of specific use-cases. That said, it has to actually solve problems and be usable, so I am looking for real experience as we iterate through this and evolve a solution. I don't want to force people to jump through terrible hoops to achieve results, but we just can't accommodate every case in the core design. I said "evolve". I doubt very much we'll get it right on the first try, so let's keep in mind that this is a pre-release thing until we have a few solid testimonials. Until then, any early-adopters had better be ready to adjust as we fine-tune :) On service IPs: An option for freeing up and differentiating service IPs might Re: linked-in vs not, I think that for a number of low-bandwidth plugins we Re: When is a node ready? If we have a node init hook, the node is ready when that hook completes. It's not clear to me what the relationships between network and cloud provider and node should be, but I can certainly see use cases for flannel on GCE, for example, so they are not the same concern - there's some orthogonality. I'm going to read #5069 next and try to get concrete. |
New proposal PR opened at #15465, let's make this happen |
The network SIG is working on a proposal, so I am closing the existing proposals. Please reopen if you think I am wrong. :) |
The Kubernetes network design works great for interconnecting containers and build services construct to avoid ports-conflict ,etc. I am capturing some use-cases that, in my opinion, are not supported due to the current network design (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/networking.md).
Use-Case 1: Multi-tenancy - multiple disjoint container networks:
Today, a container can talk to another container as long as it knows the IP and port (or perhaps it can scan the entire range). Supporting isolation between multiple container networks will disallow unintended communication, especially if a user is running application developed by third party. This is a fundamental need for multi-tenant deployment. It also allows separation needed for accidental (non malicious) inter-connectivity/communication between a set of apps, for example a development app server can’t accidentally connect to a production db, etc. The separation also allows for overlapping IP address space in disjoint networks (see IPAM comment below), secure PaaS infrastructure, and flexible rules using which these apps can communicate with each other.
Use-Case 2: Bridged/L2-connectivity
Bridged connectivity at layer2 between applications/containers will help enable following sub use-cases:
Use-Case 3: SDN applications
Most of the SDN apps manage the life-cycle of various flows by applying rules to which the communication between apps need to abide by to get the advantage of software defined network in very unique ways. Allowing POD’s IP/network and eventually flow manipulation (iptables in linux-bridge or openflow rules in ovs) will allow Kubernetes controller to be used in such deployments to schedule jobs and all rest of the goodness. Perhaps flow manipulation can be done outside Kubernetes (pardon my ignorance, I am still catching up on the code), certainly IP/MAC allocation, subnet-reachability and hooks thereof are the ones tied in closely.
Use-Case 4: Multicast Applications
Applications that use IP multicast (PIM-SM/SSM, or PIM-BIDIR), for example to to stream video-webcast, will benefit from container deployment. The network model of kubernetes needs to change to allow applications to join (igmp) the multicast-tree or become a source of multicast tree by allowing to send multi-destination traffic within the network (stated above the need for multi-destination apps in deployment).
Then there are clustering application that rely on multicast to discover the peers (but that discovery is usually done within a bridged domain, which ties into previous use case stated earlier).
Use-Case 5: IPAM integration with current tools
For infrastructure users that use somewhat sophisticated IPAM policy, for example CNR, (http://en.wikipedia.org/wiki/IP_address_management), it will allow backward compatibility to current suite of tools, if the IPAM is decoupled from scheduling/replication-controllers in kubernetes.
Summary
Kubernetes is a great tool; decoupling networking code, and exposing them as APIs, would allow more use cases. To that extend, the proposal is asking to:
If people agree this is worth a shot, then along with the community, I can work on the code changes and submit it back to the community for review towards this initiative.
Disclaimer: I work for Cisco
The text was updated successfully, but these errors were encountered: