Initial ipv6 / iptables work #2147

justinsb · 2014-11-04T01:57:20Z

More for discussion than for actual merging (for now).

I think IPv6 could solve the address allocation problem that seems to be holding k8s back on EC2. EC2 "tolerates" IPv6 for internal networking by using protocol 41 encapsulation.

I'm currently working on getting cluster/kube-up.sh to work, but in the meantime feedback on this idea would be helpful!

jbeda · 2014-11-04T21:37:36Z

Does this make sense before Docker supports IPv6?

thockin · 2014-11-04T21:54:11Z

pkg/util/iptables/iptables.go

 }

+type Protocol bool


please use int or byte rather than bool - bool will just encourage implicit truth testing.

thockin · 2014-11-04T21:58:54Z

I don't see any obvious problems here, but there are a lot of unknowns (to me anyway).

Disabling masquerade sounds OK, but I have no idea what that does to egress in v6 environments. (e.g. will the fabric egress traffic with a source IP that isn't "officially known" ? I know GCE won't (for v4, no v6 support).

I don't know v6 well, and so can't visually vet the iptables changes without testing.

Do we need v6 for service portals? Those IPs never hit the wire anyway...

justinsb · 2014-11-04T22:31:57Z

Here's my patch for docker ipv6 support: moby/moby#8896 Hopefully soon! But I'm trying to do it concurrently in case we want more features in docker itself.

The idea is that we won't actually route IPv6 outside of our k8s cluster, on EC2. (Unless EC2 adds support for "real" IPv6 in a week or two ;-) Instead, we'll do something like NAT to get to the IPv4 internet from IPv6. NAT for IPv6, either means running a http proxy (for really locked down environments) or running something like NAT64 (I tried this with TAYGA and it worked, though I hope we can find something even better!)

If we're on a machine which actually supports real IPv6 (so e.g. the host has a /64), then egress (and ingress!) will work. You often have to respond to neighbor-discovery requests, which you do like this: http://www.ipsidixit.net/2010/03/24/239/ I imagine we'll add this into docker for "real" ipv6.

For inbound traffic on EC2, we will have to continue to listen on an IPv4 address, so that EC2 can talk to it. Most likely scenario is that we have nginx/haproxy running in a pod, which then forwards to the correct backend services (over IPv6).

For service portals, IPv4 vs IPv6 doesn't really matter, but I think IPv6 not only gives us more IP addresses but also means that the pods could be IPv6 only.

I think it's also likely that we end up with each Docker instance having a 172.16.x.x IPv4 and a routable IPv6 address.

Early, but I think this is a good way to explore the option (?). Let me know if you'd rather switch to a different medium.

jbeda · 2014-11-04T22:35:42Z

I see -- you are working the entire stack :)

I'd love to make sure that this works on top of GCE -- which also, unhappily, doesn't currently have IPv6 support. If it'll work on EC2 we can probably make it work on GCE too.

thockin · 2014-11-04T22:39:04Z

fair enough. I don't see any problems with it, but I have no way to test it or know that it stays working. Maybe we can get some e2e support for ipv6

justinsb · 2014-11-04T22:53:55Z

I'm definitely pushing forwards on multiple fronts here ;-)...
https://code.google.com/p/google-compute-engine/issues/detail?id=8
https://code.google.com/p/google-compute-engine/issues/detail?id=9

I agree this needs e2e tests! I was able to launch a Docker instance on a k8s-minion on EC2 with an IPv6 address using docker run; though I had to manually set up the VPC, security groups, instances etc (for now). Next step is to try a k8s pod (and get my ec2 scripts building that VPC etc!) But there's a lot of moving parts here, so I want to try to get everything out there early!

jbeda · 2014-11-04T22:56:11Z

We'd be happy to get your EC2 scripts checked in.

justinsb · 2014-11-05T00:53:19Z

Awesome - I will tidy up the EC2 scripts a little and get them pushed to a branch / PR.

It uses a set (via a map) of allocated IPs

brendandburns · 2014-11-14T18:34:34Z

LGTM, merging.

Initial ipv6 / iptables work

thockin · 2014-11-17T17:19:13Z

pkg/registry/service/ip_allocator.go

+
+	// Try randomly first
+	for i := 0; i < ipa.randomAttempts; i++ {
+		ip := ipa.createRandomIp()


Is the random aspect of this important or just easy?

Keeping a full bitmap is out of the question for e.g. a /64, which is what motivated the change.

This should be more efficient than a linear scan, if we expect the address space to be sparsely populated. But there are also correctness aspects:

I've had problems with IP address reuse in the past:

where ARP or the ipv6 equivalents got confused (surmountable with unsolicited ARP and equivalents)

where the kernel cgroups or bridge got confused (particularly with IPv6; the symptom was that attempting to assign the IPv6 address to the LXC instance would just fail, but after a few instance restarts / time-delay it would eventually work. I don't know if this still happens, or whether I was just doing something wrong.)

Also, it seems a little risky to assign an IP address immediately to the next requester, in case that is a different tenant. Having an LRU queue would probably be better.

Of course, these are real problems, and randomizing just buries them in the long-tails. We can change randomAttempts to 0 or just remove the randomizing code, to see if any of these problems still occur.

I was specifically asking if it random is what mattered or if "don't reuse"
is what matters. I agree with the latter, the former causes me a very
small but of angst wrt static addresses for cluster services like DNS.

On Mon, Nov 17, 2014 at 9:42 AM, Justin Santa Barbara <
notifications@github.com> wrote:

In pkg/registry/service/ip_allocator.go:

nextBit, err := ffs(freeMask)

if err != nil {

// If this happens, something really weird is going on.

glog.Errorf("ffs(%#x) had an unexpected error: %s", freeMask, err)

return nil, err

}

ipa.used[i] |= 1 << nextBit

offset := (i \* 8) + int(nextBit)

ip := ipAdd(ipa.subnet.IP, offset)

if int64(ipa.used.Size()) == ipa.ipSpaceSize {

return nil, fmt.Errorf("can't find a free IP in %s", ipa.subnet)

}

// Try randomly first

for i := 0; i < ipa.randomAttempts; i++ {

ip := ipa.createRandomIp()

Keeping a full bitmap is out of the question for e.g. a /64, which is what
motivated the change.

This should be more efficient than a linear scan, if we expect the address
space to be sparsely populated. But there are also correctness aspects:

I've had problems with IP address reuse in the past:

where ARP or the ipv6 equivalents got confused (surmountable with
unsolicited ARP and equivalents)

where the kernel cgroups or bridge got confused (particularly with
IPv6; the symptom was that attempting to assign the IPv6 address to the LXC
instance would just fail, but after a few instance restarts / time-delay it
would eventually work. I don't know if this still happens, or whether I was
just doing something wrong.)

Also, it seems a little risky to assign an IP address immediately to the
next requester, in case that is a different tenant. Having an LRU queue
would probably be better.

Of course, these are real problems, and randomizing just buries them in
the long-tails. We can change randomAttempts to 0 or just remove the
randomizing code, to see if any of these problems still occur.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/pull/2147/files#r20450818
.

Ah - gotcha! Yes, random is just a cheap-and-cheerful way of implementing (probably) don't-reuse

The only other thing is that random also avoids trivially disclosing how many other instances are running, which is important in some shared environments.

Initial ipv6 / iptables work

9a053a4

smarterclayton assigned thockin Nov 4, 2014

Extend container_bridge.py (salt lib) to support IPv6

8e70a66

thockin reviewed Nov 4, 2014
View reviewed changes

Allow specification of docker daemon args

ddaa716

justinsb added 2 commits November 6, 2014 21:26

Create ip_allocator that copes with IPv6

04313ff

It uses a set (via a map) of allocated IPs

gofmt

b19170f

brendandburns added a commit that referenced this pull request Nov 14, 2014

Merge pull request #2147 from justinsb/ipv6

c2485a4

Initial ipv6 / iptables work

brendandburns merged commit c2485a4 into kubernetes:master Nov 14, 2014

thockin reviewed Nov 17, 2014
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial ipv6 / iptables work #2147

Initial ipv6 / iptables work #2147

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

thockin Nov 4, 2014

thockin commented Nov 4, 2014

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

thockin commented Nov 4, 2014

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

justinsb commented Nov 5, 2014

brendandburns commented Nov 14, 2014

thockin Nov 17, 2014

justinsb Nov 17, 2014

thockin Nov 17, 2014

justinsb Nov 17, 2014

Initial ipv6 / iptables work #2147

Initial ipv6 / iptables work #2147

Conversation

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

thockin Nov 4, 2014

Choose a reason for hiding this comment

thockin commented Nov 4, 2014

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

thockin commented Nov 4, 2014

justinsb commented Nov 4, 2014

jbeda commented Nov 4, 2014

justinsb commented Nov 5, 2014

brendandburns commented Nov 14, 2014

thockin Nov 17, 2014

Choose a reason for hiding this comment

justinsb Nov 17, 2014

Choose a reason for hiding this comment

thockin Nov 17, 2014

Choose a reason for hiding this comment

justinsb Nov 17, 2014

Choose a reason for hiding this comment