Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial ipv6 / iptables work #2147

Merged
merged 5 commits into from
Nov 14, 2014
Merged

Initial ipv6 / iptables work #2147

merged 5 commits into from
Nov 14, 2014

Conversation

justinsb
Copy link
Member

@justinsb justinsb commented Nov 4, 2014

More for discussion than for actual merging (for now).

I think IPv6 could solve the address allocation problem that seems to be holding k8s back on EC2. EC2 "tolerates" IPv6 for internal networking by using protocol 41 encapsulation.

I'm currently working on getting cluster/kube-up.sh to work, but in the meantime feedback on this idea would be helpful!

@jbeda
Copy link
Contributor

jbeda commented Nov 4, 2014

Does this make sense before Docker supports IPv6?

}

type Protocol bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use int or byte rather than bool - bool will just encourage implicit truth testing.

@thockin
Copy link
Member

thockin commented Nov 4, 2014

I don't see any obvious problems here, but there are a lot of unknowns (to me anyway).

Disabling masquerade sounds OK, but I have no idea what that does to egress in v6 environments. (e.g. will the fabric egress traffic with a source IP that isn't "officially known" ? I know GCE won't (for v4, no v6 support).

I don't know v6 well, and so can't visually vet the iptables changes without testing.

Do we need v6 for service portals? Those IPs never hit the wire anyway...

@justinsb
Copy link
Member Author

justinsb commented Nov 4, 2014

Here's my patch for docker ipv6 support: moby/moby#8896 Hopefully soon! But I'm trying to do it concurrently in case we want more features in docker itself.

The idea is that we won't actually route IPv6 outside of our k8s cluster, on EC2. (Unless EC2 adds support for "real" IPv6 in a week or two ;-) Instead, we'll do something like NAT to get to the IPv4 internet from IPv6. NAT for IPv6, either means running a http proxy (for really locked down environments) or running something like NAT64 (I tried this with TAYGA and it worked, though I hope we can find something even better!)

If we're on a machine which actually supports real IPv6 (so e.g. the host has a /64), then egress (and ingress!) will work. You often have to respond to neighbor-discovery requests, which you do like this: http://www.ipsidixit.net/2010/03/24/239/ I imagine we'll add this into docker for "real" ipv6.

For inbound traffic on EC2, we will have to continue to listen on an IPv4 address, so that EC2 can talk to it. Most likely scenario is that we have nginx/haproxy running in a pod, which then forwards to the correct backend services (over IPv6).

For service portals, IPv4 vs IPv6 doesn't really matter, but I think IPv6 not only gives us more IP addresses but also means that the pods could be IPv6 only.

I think it's also likely that we end up with each Docker instance having a 172.16.x.x IPv4 and a routable IPv6 address.

Early, but I think this is a good way to explore the option (?). Let me know if you'd rather switch to a different medium.

@jbeda
Copy link
Contributor

jbeda commented Nov 4, 2014

I see -- you are working the entire stack :)

I'd love to make sure that this works on top of GCE -- which also, unhappily, doesn't currently have IPv6 support. If it'll work on EC2 we can probably make it work on GCE too.

@thockin
Copy link
Member

thockin commented Nov 4, 2014

fair enough. I don't see any problems with it, but I have no way to test it or know that it stays working. Maybe we can get some e2e support for ipv6

@justinsb
Copy link
Member Author

justinsb commented Nov 4, 2014

I'm definitely pushing forwards on multiple fronts here ;-)...
https://code.google.com/p/google-compute-engine/issues/detail?id=8
https://code.google.com/p/google-compute-engine/issues/detail?id=9

I agree this needs e2e tests! I was able to launch a Docker instance on a k8s-minion on EC2 with an IPv6 address using docker run; though I had to manually set up the VPC, security groups, instances etc (for now). Next step is to try a k8s pod (and get my ec2 scripts building that VPC etc!) But there's a lot of moving parts here, so I want to try to get everything out there early!

@jbeda
Copy link
Contributor

jbeda commented Nov 4, 2014

We'd be happy to get your EC2 scripts checked in.

@justinsb
Copy link
Member Author

justinsb commented Nov 5, 2014

Awesome - I will tidy up the EC2 scripts a little and get them pushed to a branch / PR.

It uses a set (via a map) of allocated IPs
@brendandburns
Copy link
Contributor

LGTM, merging.

brendandburns added a commit that referenced this pull request Nov 14, 2014
Initial ipv6 / iptables work
@brendandburns brendandburns merged commit c2485a4 into kubernetes:master Nov 14, 2014

// Try randomly first
for i := 0; i < ipa.randomAttempts; i++ {
ip := ipa.createRandomIp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the random aspect of this important or just easy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping a full bitmap is out of the question for e.g. a /64, which is what motivated the change.

This should be more efficient than a linear scan, if we expect the address space to be sparsely populated. But there are also correctness aspects:

I've had problems with IP address reuse in the past:

  • where ARP or the ipv6 equivalents got confused (surmountable with unsolicited ARP and equivalents)
  • where the kernel cgroups or bridge got confused (particularly with IPv6; the symptom was that attempting to assign the IPv6 address to the LXC instance would just fail, but after a few instance restarts / time-delay it would eventually work. I don't know if this still happens, or whether I was just doing something wrong.)

Also, it seems a little risky to assign an IP address immediately to the next requester, in case that is a different tenant. Having an LRU queue would probably be better.

Of course, these are real problems, and randomizing just buries them in the long-tails. We can change randomAttempts to 0 or just remove the randomizing code, to see if any of these problems still occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was specifically asking if it random is what mattered or if "don't reuse"
is what matters. I agree with the latter, the former causes me a very
small but of angst wrt static addresses for cluster services like DNS.

On Mon, Nov 17, 2014 at 9:42 AM, Justin Santa Barbara <
notifications@github.com> wrote:

In pkg/registry/service/ip_allocator.go:

  •       nextBit, err := ffs(freeMask)
    
  •       if err != nil {
    
  •           // If this happens, something really weird is going on.
    
  •           glog.Errorf("ffs(%#x) had an unexpected error: %s", freeMask, err)
    
  •           return nil, err
    
  •       }
    
  •       ipa.used[i] |= 1 << nextBit
    
  •       offset := (i \* 8) + int(nextBit)
    
  •       ip := ipAdd(ipa.subnet.IP, offset)
    
  • if int64(ipa.used.Size()) == ipa.ipSpaceSize {
  •   return nil, fmt.Errorf("can't find a free IP in %s", ipa.subnet)
    
  • }
  • // Try randomly first
  • for i := 0; i < ipa.randomAttempts; i++ {
  •   ip := ipa.createRandomIp()
    

Keeping a full bitmap is out of the question for e.g. a /64, which is what
motivated the change.

This should be more efficient than a linear scan, if we expect the address
space to be sparsely populated. But there are also correctness aspects:

I've had problems with IP address reuse in the past:

  • where ARP or the ipv6 equivalents got confused (surmountable with
    unsolicited ARP and equivalents)
  • where the kernel cgroups or bridge got confused (particularly with
    IPv6; the symptom was that attempting to assign the IPv6 address to the LXC
    instance would just fail, but after a few instance restarts / time-delay it
    would eventually work. I don't know if this still happens, or whether I was
    just doing something wrong.)

Also, it seems a little risky to assign an IP address immediately to the
next requester, in case that is a different tenant. Having an LRU queue
would probably be better.

Of course, these are real problems, and randomizing just buries them in
the long-tails. We can change randomAttempts to 0 or just remove the
randomizing code, to see if any of these problems still occur.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/pull/2147/files#r20450818
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - gotcha! Yes, random is just a cheap-and-cheerful way of implementing (probably) don't-reuse

The only other thing is that random also avoids trivially disclosing how many other instances are running, which is important in some shared environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants