Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: deouple networking for segmentation and other use cases #3350

Closed
jainvipin opened this issue Jan 9, 2015 · 42 comments
Closed

Proposal: deouple networking for segmentation and other use cases #3350

jainvipin opened this issue Jan 9, 2015 · 42 comments
Labels
kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@jainvipin
Copy link

The Kubernetes network design works great for interconnecting containers and build services construct to avoid ports-conflict ,etc. I am capturing some use-cases that, in my opinion, are not supported due to the current network design (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/design/networking.md).

Use-Case 1: Multi-tenancy - multiple disjoint container networks:

Today, a container can talk to another container as long as it knows the IP and port (or perhaps it can scan the entire range). Supporting isolation between multiple container networks will disallow unintended communication, especially if a user is running application developed by third party. This is a fundamental need for multi-tenant deployment. It also allows separation needed for accidental (non malicious) inter-connectivity/communication between a set of apps, for example a development app server can’t accidentally connect to a production db, etc. The separation also allows for overlapping IP address space in disjoint networks (see IPAM comment below), secure PaaS infrastructure, and flexible rules using which these apps can communicate with each other.

Use-Case 2: Bridged/L2-connectivity

Bridged connectivity at layer2 between applications/containers will help enable following sub use-cases:

  • Applications that rely on multi-destination services (e.g. mDNS for auto-discovery, or broadcast for DHCP based IPAM), will benefit from this support.
  • Applications that require communicating with non-container applications at l2 level. Say a infrastructure provider has a hardware-accelerated firewall device and require that all traffic exiting a given l2-domain (aka bridged network) must traverse through the firewall.
  • Applications that require their peers apps to be in the same subnet. While I understand there are very few apps that still talk non-IP, but apps that cluster together do use some basic form of adjacent l2 reachability (e.g. Jboss clusters) to discover/interact with each other. Today, since an entire subnet belongs to one host this is not possible. With some changes to flannel (or equivalent overlay), it should be possible to support this with host routes advertisements and install these routes upon receiving updates from others (think scale!). On other hand use of services for this is neither needed, nor desired (for scale and indirection).

Use-Case 3: SDN applications

Most of the SDN apps manage the life-cycle of various flows by applying rules to which the communication between apps need to abide by to get the advantage of software defined network in very unique ways. Allowing POD’s IP/network and eventually flow manipulation (iptables in linux-bridge or openflow rules in ovs) will allow Kubernetes controller to be used in such deployments to schedule jobs and all rest of the goodness. Perhaps flow manipulation can be done outside Kubernetes (pardon my ignorance, I am still catching up on the code), certainly IP/MAC allocation, subnet-reachability and hooks thereof are the ones tied in closely.

Use-Case 4: Multicast Applications

Applications that use IP multicast (PIM-SM/SSM, or PIM-BIDIR), for example to to stream video-webcast, will benefit from container deployment. The network model of kubernetes needs to change to allow applications to join (igmp) the multicast-tree or become a source of multicast tree by allowing to send multi-destination traffic within the network (stated above the need for multi-destination apps in deployment).
Then there are clustering application that rely on multicast to discover the peers (but that discovery is usually done within a bridged domain, which ties into previous use case stated earlier).

Use-Case 5: IPAM integration with current tools

For infrastructure users that use somewhat sophisticated IPAM policy, for example CNR, (http://en.wikipedia.org/wiki/IP_address_management), it will allow backward compatibility to current suite of tools, if the IPAM is decoupled from scheduling/replication-controllers in kubernetes.

Summary

Kubernetes is a great tool; decoupling networking code, and exposing them as APIs, would allow more use cases. To that extend, the proposal is asking to:

  • Keep the current networking model as default (i.e. no change to the user experience), but decouple networking from rest of the functions (aka scheduling/replication-controllers/services)
  • Allow others to plug-in or enhance kubernetes itself to support the above use cases.
  • It will also fit in the plugin model for network that is being worked upon and proposed in docker and coreos community (Proposal: Network Drivers moby/moby#9983 and Proposal: Rocket Networking rkt/rkt#273)

If people agree this is worth a shot, then along with the community, I can work on the code changes and submit it back to the community for review towards this initiative.

Disclaimer: I work for Cisco

@jainvipin jainvipin changed the title deouple networking from scheduling Proposal: deouple networking from scheduling Jan 9, 2015
@smarterclayton
Copy link
Contributor

This makes a lot of sense to me, and matches many use cases we believe exist for real world deployments at scale. One fundamental question is how disjoint networks with overlapping ip space require changes to core Kube assumptions. I believe at a minimum within a namespace or related sets of namespaces we should continue to require ips to be unique, and look to how across namespaces we can define separate regions of behavior.

@bgrant0607 bgrant0607 added sig/network Categorizes an issue or PR as relevant to SIG Network. kind/design Categorizes issue or PR as related to design. labels Jan 10, 2015
@jainvipin
Copy link
Author

@smarterclayton - agree; keeping the IPs unique within a namespace makes sense, and define separate regions of behavior within each namespace.

I don't think the code changes are trivial; you along with other major contributors can comment more. Even if refactoring the code is bigger change, IMO the long term benefits of an API based interface for network functionality is worth it for future-proofing Kubernetes. In fact, it is inline with the design principles of Kubernetes, that clearly defines discrete functions that plug in to make a solution.

@erictune
Copy link
Member

If k8s-allocated service IPs are not by default unique across namespaces then we would want to have a way to allocate ones that are. Something along the lines of createExternalLoadBalancer, except not exposed to the internet, just to other "internal" clients.

@smarterclayton
Copy link
Contributor

It's possible service IPs can be the exception to the rule because they are "virtual" - if you wanted to use a real service IP (via some special integration mechanism), we'd want to change how service IPs are allocated from "blocking, inside POST" to "controller driven via watch, extensible". Other code would then need to handle "service doesn't have IP yet" which doesn't seem onerous. With that in mind you'd be able to build integrations that do IP allocation to real systems for performance, or for more advanced virtual IPs (not iptables) via SDN controllers that take into account what flows you need to be part of.

----- Original Message -----

If k8s-allocated service IPs are not by default unique across namespaces then
we would want to have a way to allocate ones that are. Something along the
lines of createExternalLoadBalancer, except not exposed to the internet,
just to other "internal" clients.


Reply to this email directly or view it on GitHub:
#3350 (comment)

@rajatchopra
Copy link
Contributor

Agree with all the use cases listed in this proposal.
I worked a bit on putting together a few networking solutions to kubernetes, and I agree that it is best if we can have hook points where the network provider can be called where it can suitably and independently act.

Proposed hooks

Proposing a basic call-out api (southbound) from kubernetes, where 'sdn' is a packaged plugin or an interface that is free to call out binary if it chooses to.

API no. Name Called from Args Comments
1 sdn.startMaster master, after etcd is initialized () Initialize (sub) network range etc
2 sdn.updateMinion master, when minion event occurs (host, eventType) init host subnets?
3 sdn.updatePod master, scheduler after deciding on which host to write to, but before actually writing to etcd when a new pod is born (pod, eventType) request network parameters for new pod e.g. new vlan, ipam etc. In case of a delete event, relinquish the network resources
4 sdn.startNode minion, before kubelet begins watching boundpods () init subnet? rewire docker etc
5 sdn.updatePod minion, before/after network container of the pod is launched (pod) apply the network parameters IP/Mac, program the vswitch etc

Integration examples

A table on how some solutions could use the above APIs

Solution API numbers Comments
Flannel 1,4 just init the network at the beginning, and at each node's start
OVS-simple 1,2,4 watch for minions added/removed, and init the entire minion's subnet at node's start
OVS-complex 1,3,4,5 assign an IP/Mac/VxLAN when each pod is born, and program the switch at node when each pod's networking comes alive

The OVS simple/complex solutions have been borrowed from here.

@smarterclayton
Copy link
Contributor

Why are these not simply watching Kubernetes and making the appropriate changes to other resources? And why not just pass this info into docker at network creation time?

On Jan 13, 2015, at 5:23 PM, Rajat Chopra notifications@github.com wrote:

Agree with all the use cases listed in this proposal.
I worked a bit on putting together a few networking solutions to kubernetes, and I agree that it is best if we can have hook points where the network provider can be called where it can suitably and independently act.

Proposed hooks

Proposing a basic call-out api (southbound) from kubernetes, where 'sdn' is a packaged plugin or an interface that is free to call out binary if it chooses to.

API no. Name Called from Args Comments
1 sdn.startMaster master, after etcd is initialized () Initialize (sub) network range etc
2 sdn.updateMinion master, when minion event occurs (host, eventType) init host subnets?
3 sdn.updatePod master, scheduler after deciding on which host to write to, but before actually writing to etcd when a new pod is born (pod, eventType) request network parameters for new pod e.g. new vlan, ipam etc. In case of a delete event, relinquish the network resources
4 sdn.startNode minion, before kubelet begins watching boundpods () init subnet? rewire docker etc
5 sdn.updatePod minion, before/after network container of the pod is launched (pod) apply the network parameters IP/Mac, program the vswitch etc
Integration examples

A table on how some solutions could use the above APIs

Solution API numbers Comments
Flannel 1,4 just init the network at the beginning, and at each node's start
OVS-simple 1,2,4 watch for minions added/removed, and init the entire minion's subnet at node's start
OVS-complex 1,3,4,5 assign an IP/Mac/VxLAN when each pod is born, and program the switch at node when each pod's networking comes alive
The OVS simple/complex solutions have been borrowed from here.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

More concretely, invert this so you're observing Kube and putting the right information in the right place. We don't use the "send notification this happened" for the core code paths so we need to change the pattern of the info you need to do the right thing at the network layer.

E.g.

  1. Will probably never exist in Kube, but you can easily implement this in an SDNController (initialize on startup)
  2. Is unlikely to be what you need - you want to ensure a node is assigned a network segment, but that can be controlled at the kubelet level (a plugin there along the lines of "what information do I need about myself")
  3. We're unlikely to intercept events in the scheduler because it's not re-entrant - you probably need your SDNController to register the pod and node and any other relevant info in either your external store, or record the data that decides what unique value gets assigned.

A better bet may be to plugin to the kubelet or Docker and block startup until that information becomes available. It's certainly possible for you to annotate the pods themselves and read that data, but you may need a side channel.

It'd be best to demonstrate the concrete steps we need for each piece (assignments, mappings, and flow of info) so that we can find the solution that is going to preserve the core flows in place today.

On Jan 13, 2015, at 5:23 PM, Rajat Chopra notifications@github.com wrote:

Agree with all the use cases listed in this proposal.
I worked a bit on putting together a few networking solutions to kubernetes, and I agree that it is best if we can have hook points where the network provider can be called where it can suitably and independently act.

Proposed hooks

Proposing a basic call-out api (southbound) from kubernetes, where 'sdn' is a packaged plugin or an interface that is free to call out binary if it chooses to.

API no. Name Called from Args Comments
1 sdn.startMaster master, after etcd is initialized () Initialize (sub) network range etc
2 sdn.updateMinion master, when minion event occurs (host, eventType) init host subnets?
3 sdn.updatePod master, scheduler after deciding on which host to write to, but before actually writing to etcd when a new pod is born (pod, eventType) request network parameters for new pod e.g. new vlan, ipam etc. In case of a delete event, relinquish the network resources
4 sdn.startNode minion, before kubelet begins watching boundpods () init subnet? rewire docker etc
5 sdn.updatePod minion, before/after network container of the pod is launched (pod) apply the network parameters IP/Mac, program the vswitch etc
Integration examples

A table on how some solutions could use the above APIs

Solution API numbers Comments
Flannel 1,4 just init the network at the beginning, and at each node's start
OVS-simple 1,2,4 watch for minions added/removed, and init the entire minion's subnet at node's start
OVS-complex 1,3,4,5 assign an IP/Mac/VxLAN when each pod is born, and program the switch at node when each pod's networking comes alive
The OVS simple/complex solutions have been borrowed from here.


Reply to this email directly or view it on GitHub.

@jainvipin
Copy link
Author

@rajatchopra, @smarterclayton
Exactly the discussion I was hoping we can get into!

Instead of jumping to the actual API, may I suggest that we discuss two things

[1] Design goals of such an API:

  • API is not very specific to specific to the use cases
  • Special functions, like rules enforcement (think umpteen iptables rules!) are preformed in the drivers using the generic API trigger events

[2] Trigger point requirements for such an API (to keep things simple and consistent, we can assume they are triggered on both master/kubelet)

  • Network create/delete/update events, where network is defined as a subnet (or bridged domain)
  • Network AttachPoint create/delete/update, where an attach point is a leg into the network. Typically a pod would associate one attach-point in a network.
  • External AttachPoint create/delete/update of an external attachpoint (unlike the previous one this is exposed out, and will glue to proxy)

Of course there is reality of the code:

  • Where are these APIs calls made? IMHO in master, kubelet, and in proxy
  • When are these calls made? Depending on the event, either upon starting the daemon or later as pods gets scheduled.
  • In many cases the plugin events will translate to the driver plugin calls mentioned here; this tremendously simplifies things from having to inter-work between things. So I prefer @smarterclayton 's suggestion for this:

A better bet may be to plugin to the kubelet or Docker and block startup until that information becomes available. It's certainly possible for you to annotate the pods themselves and read that data, but you may need a side channel.

Like others, I have many more things to discuss. However, IMO it is most important to get consensus about the design goals and requirements first. And of course my original question in the beginning - if Kubernetes designers also think this is a good direction: @thockin, @brendanburns, @lavalamp, @jbeda ?.

@lavalamp
Copy link
Member

Can we change the name of this to "Proposal: support network segmentation"? Fixing up scheduling would be a relatively minor requirement.

If we want to support network segmentation, I think it's logical that namespaces are the units that you place into a network segment. (so multiple namespaces could be on the same segment, but a single namespace can't be split across segments.) I also think that each kubelet should be in only one network segment, which makes everything way easier.

I see some discussion about k8s components making api calls to network fabric implementations every time a pod starts. That doesn't sound scalable to me.

Some evolution of our current system where each kubelet gets a range of IP addresses is probably more feasible. I could maybe see doing a call out to a network fabric to request an IP range for a node when it joins the cluster.

The other part of this is defining some way of specifying how bridging happens, so you can use a service in another namespace. I think it'll be clear how to do that once we get the service object hashed out, it may be a simple as just making "external" service localNS/foo that points to remoteNS/serviceFoo's public IP(s).

Scheduling should be as simple as putting a segment label (or something similar) on nodes.

@erictune
Copy link
Member

We have previously talked about partitioning kubelets into groups for reasons not specifically related to networking (such as QoS). We should think about how network segments would interact with partitions.

@erictune
Copy link
Member

I'm not sure how well it will work to require kubelets to be in only one segment.

  • If there is one segment for every namespace, then it seem like you are going to end up with poor utilization.
  • if there are enough namespaces sharing a segment to result in good utilization, then the security and isolation benefits are diluted.

@lavalamp
Copy link
Member

I guess it depends on how much stuff will be running in a single namespace.

At any rate, it is probably possible to make kubelet multi-segment-aware in the future. But it's probably easier to get it working/designed initially without that requirement.

@smarterclayton
Copy link
Contributor

I think we have a concrete use case today in OpenShift (for which we're going to do extra work obviously) that would have many segments per kubelet. For dev style environments you want aggressive encapsulation, and you'd run out of kubelets before you ran out of people who just want to run a 128mb web server that uses 0.1% of the CPU.

----- Original Message -----

I'm not sure how well it will work to require kubelets to be in only one
segment.

  • If there is one segment for every namespace, then it seem like you are
    going to end up with poor utilization.
  • if there are enough namespaces sharing a segment to result in good
    utilization, then the security and isolation benefits are diluted.

Reply to this email directly or view it on GitHub:
#3350 (comment)

@jainvipin
Copy link
Author

agree with @smarterclayton. My requirement also fits multiple segments per kubelet much better - segments are very disposable and scales high. To avoid scale challenge it is best to support multiple segments per kubelet. In fact some people rightfully use the term 'micro-segmentation' to describe the use cases.

@lavalamp: How about if we remove scheduling from the proposal title, and keep it 'Proposal: decouple networking for wider set of use cases'

@shettyg
Copy link

shettyg commented Jan 14, 2015

@rajatchopra
You wrote:

OVS-simple 1,2,4 watch for minions added/removed, and init the entire minion's subnet at node's >start
OVS-complex 1,3,4,5 assign an IP/Mac/VxLAN when each pod is born, and program the switch >at node when each pod's networking comes alive

Just food for some thought: How safe is it to program a Open vSwitch in a VM where containers run? A privileged container can easily change tunnel-id or flows and access traffic from other tenants. IMO, security of containers is not yet completely established to trust them not to break out and change OVS in a VM. I think a better way to be considered is to tag traffic in a VM and create tunnels in hypervisor. Not sure Kubernetes can handle that (so worst case is same containers from same subnet will get effected but not containers from other tenants).

@mrunalp
Copy link
Contributor

mrunalp commented Jan 14, 2015

@shettyg I think that security should be a separate concern. If one is talking about multi-tenancy then they are already bought into the security of containers (however secure or not secure they may be).

@jainvipin Agree that the title of the proposal should better reflect the discussion.

Also, to note here is that we have a working prototype (read hacked together) for multi-tenancy in kubernetes and the proposal that @rajatchopra are from that POC. It will be good to get feedback and agreement on the overall design to accommodate different networking solutions.

@smarterclayton Here are some concrete examples for the items --

  1. Store some settings in etcd like the base network for e.g. 10.1.0.0/16 from which to assign subnets to the minions.
  2. This could be used for IPAM for e.g. where there is a central authority handing out minion subnets. I agree that this one is probably the weakest use case.
  3. In case of mult-tenancy, the SDN controller would need to know what namespace the pod belongs to so it can assign it the right vxlan/ip/mac/other parameters. This information needs to be populated in the pod structure before the pod can actually be run. However, in some cases, the IP address can only be assigned after the subnet is decided and hence the mention of interaction with the scheduler.
    Suggestions welcome on how best to solve this.
  4. When a minion comes up, then configure it. For e.g. modify the docker options to use the assigned subnet. How it gets the parameters is a question. It can ask the master or just watch for it. Some code running detects a minion coming up and assigns it a subnet and records it in etcd.
  5. Finally calling the network driver (after the network namespace has been created). This could be a call out a binary or a plugin which needs access to the network namespace of the network container and does the wiring inside the namespace.

@shettyg
Copy link

shettyg commented Jan 14, 2015

@mrunalp
You wrote:

I think that security should be a separate concern. If one is talking about multi-tenancy then they are >already bought into the security of containers (however secure or not secure they may be).

Let me clarify. What I am saying is that you can achieve multi-tenancy securely for containers, if you do the following.

  1. All the containers in a single VM belongs to the same tenant. (If am not wrong, that is Kubernetes assumption)
  2. OVS that is programmed by the network controller remains in the hypervisor.
  3. OVS or Linux bridge in the VM tags traffic coming out of a container with unique vlans (this limits the number of interfaces to 4096 in one VM (you need unique vlans only inside a VM. Other VMs can have the same vlans. vlan-in-vlan will give more options, if needed).
  4. In the hypervisor, the vlan tags coming from a particular vif of a VM identify a interface of container. The vlan tags are stripped. After that it is as good as considering a container as a full fledged VM in terms of network security.

If a container does break out in the VM, the worst case is that it will change the VLAN tag. So the hypervisor can believe a traffic coming out of a container is from a different container of the same VM. But it will only effect a single tenant.

My question for everyone is whether Kubernetes architecture allows it to make changes in the hypervisor that hosts a VM. (I have really no idea how Kubernetes internals work). Is this model even workable with the Kubernetes?

@jainvipin jainvipin changed the title Proposal: deouple networking from scheduling Proposal: deouple networking for segmentation and other use cases Jan 14, 2015
This was referenced Jan 16, 2015
@larsks
Copy link

larsks commented Jan 23, 2015

@mrunalp directed me here from #google-containers to describe a use case that might be relevant to this discussion.

I would like to run an application under Kubernetes that by design needs to be attached to the same L2 broadcast domain as some devices that it will manage (because it uses a broadcast discovery mechanism). If this were the only service I were running I would simply add the appropriate physical NIC to the default docker bridge and it would Just Work...

...but it's not the only contained service I want to run, and other services don't need to be/shouldn't be attached to the same physical network.

I would like a way to define "networks" to Kubernetes such that pods can request attachment to specific segments. I imagine that this would work by moving network configuration out of Docker and into Kubernetes, such that the network namespace container would be created with --net=none, and then Kubernetes would perform the appropriate interface manipulation to attach the container to specific bridge devices.

The hook proposal from @rajatchopra is sort of what I was thinking about:

  • Kubernetes would create the network namespace container, then
  • Call out to some external code to perform the network configuration, then
  • Spawn the additional containers in the pod

@jainvipin
Copy link
Author

@larsks - completely agree; in fact what you say describes the use case I tried describing in the proposal. And with hooks provided for this, it would be possible for an external entity to configure networking. This works inline with docker's network drivers proposal it fits really well where network can be flexibly configured for the use cases described in this proposal and what you describe.

As @erictune pointed out in #3585 it would make sense to prototype and learn more about the best possible ways to provide this flexibility in Kubernetes. I am playing with some code as we speak...

@mrunalp
Copy link
Contributor

mrunalp commented Feb 16, 2015

How about a community hangout next Friday (20th Feb) to discuss networking plugins?

@mrunalp
Copy link
Contributor

mrunalp commented Feb 16, 2015

I have scheduled a public hangout https://plus.google.com/events/cfgii4a4qgu5lhpptlgs1ll23o8
for Feb 20th at 1 P.M. PST

Does that work?

@mrunalp
Copy link
Contributor

mrunalp commented Feb 17, 2015

Edited to move this earlier to 10:00 a.m. to accommodate more folks. @jainvipin, does that work for you?

@jainvipin
Copy link
Author

@mrunalp - works for me...

@joeswaminathan
Copy link

Was the event moved to 9 AM ?

@rajatchopra
Copy link
Contributor

I have to apologise. Yes. Was re-arranged to 9am, but I did not put up the final time in this thread. There was a parallel email chain (on google-containers at googlegroups) about scheduling this and I missed that everybody may not be on that group.

Summary/action-items of the hangout:

  1. General agreement that network solutions need hooks. Most important of them is at the point where pod is created, specifically for the network container.
  2. Docker's network plugins may be sufficient for some networking solutions. But we need the kubelet to atleast be able to provide more information about the pod, than just the docker parameters, so that the networking plugin can act accordingly. The docker powerstrip may be useful too.
    Even with docker networking plugins we may still want southbound hooks.
  3. We want minimal common hooks, none for hypothetical cases. So we should sample some real solutions and study through actual PoCs. Atleast whiteboard the real cases. <- action-item for all interested.
    We discussed flannel, weave and ovs as probable samples (in no way an indication of importance or quality).
  4. A face-to-face/google-hangout next week? Possibly at the next Kubernetes gathering.

@thockin, kindly add/edit any minutes that I missed.

@jainvipin
Copy link
Author

@rajatchopra - I do want to bring up an important point that I brought up in the meeting as well:

  • Should we say that we target few use cases instead of flannel/weave/ovs?

The difference is big. If we target the use cases, we will cut the hooks/APIs that will cater to the use cases; Flannel/Weave/Openshift-sdn are ways to achieve those use cases. I'd also put OVS to be consumed by tools/solutions like flannel/weave/openshift-sdn, instead being in the same category.

@bgrant0607
Copy link
Member

I just want to add that I'm supportive of the direction discussed here. Kubernetes should absolutely allocate network resources and orchestrate network configuration on behalf of applications.

@ravigadde
Copy link
Contributor

I am supportive of the direction in which this is going.

I want to bring up one more use case - of deploying a load balancer like haxproxy or nginx within the cluster (instead of using the service proxy). Assume there is a pool of IP addresses (may be a /24 subnet) dedicated for external connectivity (Public IPs) for the cluster. Each such load balancer needs to be assigned an IP from this pool. If there are multiple load balancers supporting a service, assume DNS load balancing takes care of that.

This brings in two concepts that Kubernetes doesn't support today.

  1. Network pool - a resource that can be created and managed. This pool has a set of IPs (most commonly an entire subnet) and spans some/all nodes in the cluster.
  2. Network scheduler - When a pod requests a Public IP/an IP from this pool, the scheduler should be aware of two things when making the scheduling decision:
    a) if there are IPs left in the pool
    b) Set of nodes that the pool spans and evaluate resource(cpu/mem/other) constraints only on those nodes

Happy to write up a formal proposal and contribute code if there is interest.

@jainvipin
Copy link
Author

+1 on scheduler to be network/storage resource aware, that sometimes are cluster-wide and not host specific. Specifically for IPAM, we discussed this when coming up with minimal/extended APIs under this proposal. It may not be the absolute desired thing as long as the API can return an error and thus pod is prevented from getting scheduled.

Generically, the ability to evaluate a resource that is shared across nodes and corresponding code changes might need a separate discussion thread.

@rajatchopra
Copy link
Contributor

Proposing the most basic needs/changes.

  • Init : Register the network provider through the apiserver. This config should be read-only accessible.
  • API to watch minions (and namespaces?) through the apiserver.
  • Annotations (extensible fields) on the pod object and APIs to read/write.
  • Two southbound hooks from kubelet, after infra container creation, and before delete (implemented in [WIP] southbound networking hooks in kubelet #5069).

This indicates that the 'network providers' will have to be run as separate services, and not compiled with kube code.

@erictune
Copy link
Member

erictune commented Mar 4, 2015

the cloud provider is compiled into the apiserver and set via a flag on the apiserver (--cloud_provider). Should network provider follow the same pattern?

I think minions are watchable, but if they aren't, agree that they should be. minions aren't in namespaces, so maybe you don't need to watch namespaces?

annotations are already on the pod. should work.

Not following logic of why network providers have to be services.

@rajatchopra
Copy link
Contributor

the cloud provider is compiled into the apiserver and set via a flag on the
apiserver (--cloud_provider). Should network provider follow the same
pattern?

Yes. Thanks for the pointer.

I think minions are watchable, but if they aren't, agree that they should be.
minions aren't in namespaces, so maybe you don't need to watch namespaces?

Yes. They are watchable, I was looking at a really old checkout. Namespaces are a separate thing, and apparently the watch api exists there also. Some network providers may be acting on namespaces appearing/disappearing.

annotations are already on the pod. should work.

Not following logic of why network providers have to be services.

By services, I just meant separate processes/daemons. None of the 'network provider' daemons will be invoked by any of the kube processes. The alternative would have been to start the registered network-provider daemon upon Init of apiserver/kubelet.

@erictune
Copy link
Member

erictune commented Mar 5, 2015

That is what I thought you meant by services.

A cloud provider in kubernetes is a compiled-in bit of go code. It may in turn talk to a web service like GCE.

I'm wondering if/why network providers are different.

@jainvipin
Copy link
Author

Separate binary for two southbound hooks from kubelet mentioned above, will require kubelet to do active-wait, unless state transitions are asynchronously handled i.e. more code for async handling.

annotations are already on the pod. should work. Not following logic of why network providers have to be services.

By services, I just meant separate processes/daemons. None of the 'network provider' daemons will be invoked by any of the kube processes. The alternative would have been to start the registered network-provider daemon upon Init of apiserver/kubelet.

@derekwaynecarr
Copy link
Member

Assuming a network provider watches Nodes, what is the relationship between a Node being in a Ready status and the network actually being setup?

Today a NodeController inserts a Node, and once the Kubelet reports an ok health check, the node is considered Ready.

Should a Node really be ready if it's network is not yet setup? If I understand what is proposed how will the Kubelet know its ready?

Sent from my iPhone

On Mar 4, 2015, at 7:25 PM, Vipin Jain notifications@github.com wrote:

Separate binary for two southbound hooks from kubelet mentioned above, will require kubelet to do active-wait, unless state transitions are asynchronously handled i.e. more code for async handling.

annotations are already on the pod. should work. Not following logic of why network providers have to be services.

By services, I just meant separate processes/daemons. None of the 'network provider' daemons will be invoked by any of the kube processes. The alternative would have been to start the registered network-provider daemon upon Init of apiserver/kubelet.


Reply to this email directly or view it on GitHub.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 5, 2015

@erictune said:

I'm wondering if/why network providers are different.

We could create interfaces and compile in network provider implementations, but from the discussions we have had so far it seemed like there was more inclination towards keeping the network providers independent.

It could certainly be changed if we feel that is the wrong direction.

@rajatchopra
Copy link
Contributor

@erictune It will be difficult to standardize on the interface, if we were to choose a compile-in option. There are just too many possible aspects about the network (L2/L3/firewall/encap etc) and different solutions may have completely non-overlapping needs. It will be difficult to get a minimum viable set e.g. vxlan vs sriov vs ipvlan.

@derekwaynecarr We may want a sync mechanism in the future, and finalizers is one approach. i.e. kubelet does not report itself ready unless the minion resource has certain finalizer flags set alright. Nevertheless, we would want the 'network provider' to be ready and resilient to the lifecycle of the kubelet.

@derekwaynecarr
Copy link
Member

FYI: I am trying to fix terminology.

Initializers deal with making something ready.

Finalizers deal with doing what is needed before an object is allowed to be deleted.

Sent from my iPhone

On Mar 5, 2015, at 2:16 PM, Rajat Chopra notifications@github.com wrote:

@erictune It will be difficult to standardize on the interface, if we were to choose a compile-in option. There are just too many possible aspects about the network (L2/L3/firewall/encap etc) and different solutions may have completely non-overlapping needs. It will be difficult to get a minimum viable set e.g. vxlan vs sriov vs ipvlan.

@derekwaynecarr We may want a sync mechanism in the future, and finalizers is one approach. i.e. kubelet does not report itself ready unless the minion resource has certain finalizer flags set alright. Nevertheless, we would want the 'network provider' to be ready and resilient to the lifecycle of the kubelet.


Reply to this email directly or view it on GitHub.

@thockin
Copy link
Member

thockin commented Mar 17, 2015

I apologize for not responding to this thread before. I have been reading along and chewing on it. I went back through the whole thing today and here are some notes.

First, I very much agree this is useful and possible. I think the approach seems OK. I admit that I am NOT familiar enough with some of the deep networking stuff to know how ti works and what it needs, so I apologize for any dumb questions.

Second, I think it is important to keep this as simple as possible and free from details of specific use-cases. That said, it has to actually solve problems and be usable, so I am looking for real experience as we iterate through this and evolve a solution. I don't want to force people to jump through terrible hoops to achieve results, but we just can't accommodate every case in the core design.

I said "evolve". I doubt very much we'll get it right on the first try, so let's keep in mind that this is a pre-release thing until we have a few solid testimonials. Until then, any early-adopters had better be ready to adjust as we fine-tune :)

On service IPs: An option for freeing up and differentiating service IPs might
be to go to ipv6 for the virtual IPs. Not directly germaine to this topic but
it popped into my head.

Re: linked-in vs not, I think that for a number of low-bandwidth plugins we
would do well to have an out-of-process model. I think cloud-provider might be
a good candidate for that, and I think this is too.

Re: When is a node ready? If we have a node init hook, the node is ready when that hook completes. It's not clear to me what the relationships between network and cloud provider and node should be, but I can certainly see use cases for flannel on GCE, for example, so they are not the same concern - there's some orthogonality.

I'm going to read #5069 next and try to get concrete.

@thockin thockin added kind/design Categorizes issue or PR as related to design. and removed kind/design Categorizes issue or PR as related to design. priority/design labels May 19, 2015
@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 2, 2015
@feiskyer
Copy link
Member

New proposal PR opened at #15465, let's make this happen

@thockin
Copy link
Member

thockin commented Jan 16, 2016

The network SIG is working on a proposal, so I am closing the existing proposals. Please reopen if you think I am wrong. :)

@thockin thockin closed this as completed Jan 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests