Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support variable sized pod-CIDRs for different-sized node (was: Allow --node-cidr-mask-size to change: make NodeIPAM controller support variable sized CIDRs) #90922

Closed
liuxu623 opened this issue May 9, 2020 · 39 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@liuxu623
Copy link
Contributor

liuxu623 commented May 9, 2020

  • cluster cidr is 192.168.0.0/16
  • set --node-cidr-mask-size 24
  • add two nodes podCIDR will be 192.168.0.0/24 and 192.168.1.0/24
  • set --node-cidr-mask-size 23 and restart controller-manager
  • delete node 192.168.1.0/24
  • add a node pod CIDR will be 192.168.0.0/23
  • 192.168.0.0/24 and 192.168.0.0/23 overlap

I write some test code

func TestAllocate(t *testing.T) {
	_, clusterCIDR, _ := net.ParseCIDR("192.168.0.0/16")
	cs, _ := NewCIDRSet(clusterCIDR, 23)
	_, podCIDR, _ := net.ParseCIDR("192.168.0.0/24")
	cs.Occupy(podCIDR)
	_, podCIDR, _ = net.ParseCIDR("192.168.0.1/24")
	cs.Occupy(podCIDR)
	cs.Release(podCIDR)
	p,_:=cs.AllocateNext()
	fmt.Println(p.String())
}
@liuxu623 liuxu623 added the kind/bug Categorizes issue or PR as related to a bug. label May 9, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 9, 2020
@liuxu623
Copy link
Contributor Author

liuxu623 commented May 9, 2020

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 9, 2020
@andrewsykim
Copy link
Member

It would be nice if node ipam can handle cases where either cluster-cidr or node-cidr-mask-size changes but generally I don't think changing them is supported.

@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label May 9, 2020
@aojea
Copy link
Member

aojea commented May 12, 2020

  • cluster cidr is 192.168.0.0/16
  • set --node-cidr-mask-size 24
  • add two nodes podCIDR will be 192.168.0.0/24 and 192.168.1.0/24
  • set --node-cidr-mask-size 23 and restart controller-manager
  • delete node 192.168.1.0/24
  • add a node pod CIDR will be 192.168.0.0/23
  • 192.168.0.0/24 and 192.168.0.0/23 overlap

This has some consequences that I don't think Kubernetes can handle right now because it will imply that you have to delete all the nodes that overlap with the new mask.
In your example, if you don't delete one of the nodes or you reasign all the masks in all nodes, you can't increase the mask, otherwise, the new masks in one node will overlap with the masks of the other nodes.

@liuxu623
Copy link
Contributor Author

liuxu623 commented May 12, 2020

  • cluster cidr is 192.168.0.0/16
  • set --node-cidr-mask-size 24
  • add two nodes podCIDR will be 192.168.0.0/24 and 192.168.1.0/24
  • set --node-cidr-mask-size 23 and restart controller-manager
  • delete node 192.168.1.0/24
  • add a node pod CIDR will be 192.168.0.0/23
  • 192.168.0.0/24 and 192.168.0.0/23 overlap

This has some consequences that I don't think Kubernetes can handle right now because it will imply that you have to delete all the nodes that overlap with the new mask.
In your example, if you don't delete one of the nodes or you reasign all the masks in all nodes, you can't increase the mask, otherwise, the new masks in one node will overlap with the masks of the other nodes.

The problem is we use bitmap record the cidr is used or not now, but if we increase --node-cidr-mask-size, maybe more than one node use the cidr, so we need to know how many nodes use the cidr, if is zero we can use the cidr.

So I try use a map record how many nodes use the cidr, you can see #90926 and help me review the code, thank you.

@aojea
Copy link
Member

aojea commented May 12, 2020

The problem is we use bitmap record the cidr is used or not now, but if we increase --node-cidr-mask-size, maybe more than one node use the cidr, so we need to know how many nodes use the cidr, if is zero we can use the cidr.

ahhh ok, then you don't have to modify the bitmap, you just need to modify the Occupy() method to check if any of the bits are set between the begin and end returned by begin, end, err := s.getBeginingAndEndIndices(cidr)

	for i := begin; i <= end; i++ {
		if s.used.Bit(i) {
                     return fmt.Errorf("error some of the CIDRs are already allocated")
                 }
	}

// Occupy marks the given CIDR range as used. Occupy does not check if the CIDR
// range was previously used.
func (s *CidrSet) Occupy(cidr *net.IPNet) (err error) {
begin, end, err := s.getBeginingAndEndIndices(cidr)
if err != nil {
return err
}
s.Lock()
defer s.Unlock()
for i := begin; i <= end; i++ {
s.used.SetBit(&s.used, i, 1)
}
return nil
}

The comment on the method explicitly says Occupy does not check if the CIDR // range was previously used. but I don't know the context. maybe @bowei knows better.

I still have doubts that node-mask resizing can be supported, I think that the AllocateNext() method will not work after that 🤔 , or may cause conflicting assignments

@liuxu623
Copy link
Contributor Author

liuxu623 commented May 13, 2020

I still have doubts that node-mask resizing can be supported, I think that the AllocateNext() method will not work after that 🤔 , or may cause conflicting assignments

The problem is node 192.168.0.0/24 and 192.168.1.0/24 all use 192.168.0.0/23,only 192.168.0.0/24 and 192.168.1.0/24 all deleted then we can release 192.68.0.0/23, but we use bitmap now, it can only record the cidr use or not, it can't record how many nodes use the cidr, delete a node 192.168.0.0/24 or 192.168.1.0/24 all will release 192.68.0.0/23.

// Release releases the given CIDR range.
func (s *CidrSet) Release(cidr *net.IPNet) error {
	begin, end, err := s.getBeginingAndEndIndices(cidr)
	if err != nil {
		return err
	}
	s.Lock()
	defer s.Unlock()
	for i := begin; i <= end; i++ {
		s.used.SetBit(&s.used, i, 0)
	}
	return nil
}

So I try use map to record how many nodes use the cidr, when delete a node s.used[i]--AllocateNext() will not use the cidr if s.used[i] > 0, AllocateNext() will work normal.

// Release releases the given CIDR range.
func (s *CidrSet) Release(cidr *net.IPNet) error {
	begin, end, err := s.getBeginingAndEndIndices(cidr)
	if err != nil {
		return err
	}
	s.Lock()
	defer s.Unlock()
	for i := begin; i <= end; i++ {
		s.used.SetBit(&s.used, i, 0)
		if s.used[i] > 0 {
			s.used[i]--
		}
	}
	return nil
}

@liuxu623
Copy link
Contributor Author

ahhh ok, then you don't have to modify the bitmap, you just need to modify the Occupy() method to check if any of the bits are set between the begin and end returned by begin, end, err := s.getBeginingAndEndIndices(cidr)

If Occupy() return error, NodeIPAMController will not start, we don't want this.

@aojea
Copy link
Member

aojea commented May 13, 2020

Thanks for the explanation @liuxu623 , it makes more sense to me now.
However, this seems an important change on the behavior of the IPAM and usually a KEP is required.
Let's ask Tim what will be the best way to proceed

/assign @thockin

@thockin
Copy link
Member

thockin commented May 14, 2020

The short answer is that this field was not designed to be changed.

The longer answer is that it COULD be (re)designed to be changeable, but as you point out that would require to change the bitmap-based allocator to something more sophisticated. AND it would have to be done in a way that we can upgrade existing clusters with existing allocations.

I won't say that we would not take such a change, but it would need to be KEP'ed and very carefully tested with regards to in-place upgrades.

@thockin
Copy link
Member

thockin commented May 14, 2020

I am going to re-title this and declare it a feature request with relatively low prio. If someone wants to step up to implement, we can talk.

@thockin thockin changed the title NodeIPAM allocate overlap PodCIDR when increase --node-cidr-mask-size Allow --node-cidr-mask-size to change: make NodeIPAM controller support variable sized CIDRs May 14, 2020
@thockin thockin added kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed kind/bug Categorizes issue or PR as related to a bug. triage/unresolved Indicates an issue that can not or will not be resolved. labels May 14, 2020
@liuxu623
Copy link
Contributor Author

liuxu623 commented May 26, 2020

@fl-max
Copy link

fl-max commented Jun 24, 2020

I think expanding a CIDR (ie. /24 to /23) block is much more tedious than shrinking a CIDR (ie. /24 to /25). Assuming this is true, maybe it would be better to split this into two features; "Expand CIDR mask size" and "Shrink CIDR mask size". I believe the latter would be much simpler to implement.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@pacoxu
Copy link
Member

pacoxu commented Mar 2, 2021

@aojea OK. Let me follow up.
/assign

So I need to move on to the KEP first and then the PR.
I think I need to use the new KEP template and continue the PR kubernetes/enhancements#1811 here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2021
@aojea
Copy link
Member

aojea commented May 31, 2021

/remove-lifecycle stale
There is a KEP in progress kubernetes/enhancements#2593

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2021
@thockin thockin changed the title Allow --node-cidr-mask-size to change: make NodeIPAM controller support variable sized CIDRs support variable sized pod-CIDRs for different-sized node (was: Allow --node-cidr-mask-size to change: make NodeIPAM controller support variable sized CIDRs) Jun 4, 2021
@pacoxu
Copy link
Member

pacoxu commented Jun 7, 2021

/unassign
since another KEP is in process.

@aojea
Copy link
Member

aojea commented Jun 7, 2021

let's assign to the KEP owner then :)
/assign @rahulkjoshi

@aojea
Copy link
Member

aojea commented Jun 7, 2021

/assign @rahulkjoshi

@k8s-ci-robot
Copy link
Contributor

@aojea: GitHub didn't allow me to assign the following users: rahulkjoshi.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @rahulkjoshi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bnu0
Copy link

bnu0 commented Aug 6, 2021

I dont think the KEP covers what is being demonstrated here, which really seems like a bug (one which I hit today trying to change kube-controller-manager's default node podcidr prefix from a /24 to a /23).

As far as I can tell, this issue is not looking for a way to re-ip existing nodes. It is showing that if you change the kube-controller-manager argument from /24 to /23, then when a new node registers it can be errantly given a cidr that overlaps with existing nodes which already have the previous cidr size. This is very dangerous, for obvious reasons.

This happens without any attempt to re-ip anything, when a node with a /24 is deleted, it marks the whole subnet (now a /23) as available and allocates it regardless of the neighboring /24.

@aojea
Copy link
Member

aojea commented Aug 8, 2021

I dont think the KEP covers what is being demonstrated here, which really seems like a bug (one which I hit today trying to change kube-controller-manager's default node podcidr prefix from a /24 to a /23).

ah, right, sorry, I overlooked all the details.
The KEP allows to use different sizes but it keeps the limitation about not able to change the size of the assigned CIDR.
It is a known limitation, maybe should be better documented, but this field was close to be deprecated too #57130 🤷

@thockin
Copy link
Member

thockin commented Aug 21, 2021 via email

@rahulkjoshi
Copy link
Contributor

Yes Tim is right. You would delete the old config -- the finalizer preserves the config until the last Node using that range goes away, but it becomes un-allocatable. Then you add your new config and restart all the nodes to use the new cidr size.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2021
@rahulkjoshi
Copy link
Contributor

/remove-lifecycle stale

The PR mentioned above should handle the use-case where the flag value gets changed. Code is targeted to be merged in 1.24

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet