Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[occm] Improve route controller reconciling to ensure the cluster's nodes can access each other #2484

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffyjf
Copy link
Contributor

@jeffyjf jeffyjf commented Nov 28, 2023

What this PR does / why we need it:

In order to the nodes of one cluster can access each other, route controller need to check and set three things:

  1. The openstack router's route rules, so that the packets can be forwarded to correct nodes.
  2. The node port's AllowAddressPair, to ensure the node permit the packets that access the node's pods/services pass through.
  3. The openstack security group's rules, so that the nodes that bind the security group permit the packets from other nodes enter into.

The current codes just check router's route rule when controller call ListRoutes and just set Route and AllowAddressPair when controller call CreateRoute. This PR complements all of the other works.

Which issue this PR fixes(if applicable):
fixes #2482

Special notes for reviewers:

This PR lack of unit tests, because of I found occm lack of a mechanism to mock openstack client. Further more the whole occm lack of lots of unit tests due to the reason. I plan to dive deeper into gophercloud in the next several days try to study whether it is possible to mock it. If it is possible I'll commit anohter PR to add the unit tests.

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 28, 2023
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @jeffyjf. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 28, 2023
@dulek
Copy link
Contributor

dulek commented Nov 28, 2023

Why did this ended up in Routes controller? The issue description feels like this is a general issue in cases CAPO is missing SG rules to allow interconnectivity of Node.

pkg/openstack/routes.go Outdated Show resolved Hide resolved
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 29, 2023

Why did this ended up in Routes controller? The issue description feels like this is a general issue in cases CAPO is missing SG rules to allow interconnectivity of Node.

According to the offical document:

The route controller is responsible for configuring routes in the cloud appropriately so that containers on different nodes in your Kubernetes cluster can communicate with each other.

I think that this is route controller's duty. I user don't activate route controller, the interconnectivity of pods should be ensured by other mechanisms.

@jichenjc
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 29, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 6b498a6 to f9ab6ef Compare December 6, 2023 03:35
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch 3 times, most recently from 3c6af74 to d639248 Compare December 13, 2023 01:48
@jeffyjf jeffyjf changed the title [occm][WIP] Add SG to node ensure the different node's pods can access each other [occm] Add SG to node ensure the different node's pods can access each other Dec 13, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 13, 2023

@dulek @jichenjc @kayrus @zetaab asking for review

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2023
@kayrus
Copy link
Contributor

kayrus commented Dec 13, 2023

The PR needs a rebase. However the #2499 is on the way, and I adopted your getSecurityGroupRules change there as well.

@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from d639248 to eaaea60 Compare December 14, 2023 01:12
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 14, 2023

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from eaaea60 to 35f8dca Compare December 14, 2023 03:00
@jeffyjf jeffyjf changed the title [occm] Add SG to node ensure the different node's pods can access each other [occm] Improve route controller reconciling to ensure the cluster's nodes can access each other Dec 14, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 14, 2023

/remove hold

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 14, 2023

/remove-hold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 35f8dca to abe4633 Compare December 14, 2023 03:26
pkg/openstack/routes.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 14, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from abe4633 to 1a93fd9 Compare December 15, 2023 09:27
Copy link
Contributor

@dulek dulek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remarks inline. In general I'm not totally against it, but I think the code should introduce a new option that will control the new behavior of creating and updating the SG. I don't think this works for every user here.

pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
Comment on lines +383 to +404
origSgs := port.SecurityGroups
mc := metrics.NewMetricContext("port", "update")
_, err := ports.Update(network, port.ID, ports.UpdateOpts{
SecurityGroups: &sgs,
}).Extract()
if mc.ObserveRequest(err) != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prone to a lost update race condition, but it's not a big deal IMO.

pkg/openstack/routes.go Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 1a93fd9 to d1da026 Compare January 15, 2024 05:29
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kayrus for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jeffyjf jeffyjf force-pushed the route-controller-security-group branch 4 times, most recently from 87dea84 to 81236c6 Compare January 15, 2024 06:03
…onnectivity

In oreder to ensure the connectivity between different nodes, route controller need
to do the following things:

1. Check and set openstack router's route rules, so that the packets can be
   forwarded to correct nodes.
2. Check and set the node port's AllowAddressPair, to permit the packets from
   the pods pass through the node's port and leave the node.
3. Check and set openstack security group's rules, so that the nodes that bind the
   security group permit the packets from other nodes enter into.

But, the previous occm codes just check router's route rules, just set router's
route and AllowAddressPair, this patch completed the other steps.
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 81236c6 to ee22e72 Compare January 15, 2024 07:14
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Jan 15, 2024

/retest pull-cloud-provider-openstack-test

@k8s-ci-robot
Copy link
Contributor

@jeffyjf: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test openstack-cloud-controller-manager-e2e-test
  • /test openstack-cloud-controller-manager-ovn-e2e-test
  • /test openstack-cloud-csi-cinder-e2e-test
  • /test openstack-cloud-csi-cinder-sanity-test
  • /test openstack-cloud-csi-manila-e2e-test
  • /test openstack-cloud-csi-manila-sanity-test
  • /test openstack-cloud-keystone-authentication-authorization-test
  • /test pull-cloud-provider-openstack-check
  • /test pull-cloud-provider-openstack-test

Use /test all to run the following jobs that were automatically triggered:

  • openstack-cloud-controller-manager-e2e-test
  • openstack-cloud-controller-manager-ovn-e2e-test
  • pull-cloud-provider-openstack-check
  • pull-cloud-provider-openstack-test

In response to this:

/retest pull-cloud-provider-openstack-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Jan 15, 2024

/test pull-cloud-provider-openstack-test

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 14, 2024
@dulek
Copy link
Contributor

dulek commented May 23, 2024

/remove-lifecycle rotten
/lgtm

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 23, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2024
Copy link
Contributor

@kayrus kayrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. Could you take a look at comments I just left?

@@ -483,7 +485,7 @@ func (os *OpenStack) Routes() (cloudprovider.Routes, bool) {
return nil, false
}

r, err := NewRoutes(os, network, netExts["extraroute-atomic"], netExts["allowed-address-pairs"])
r, err := NewRoutes(os, network, netExts["extraroute-atomic"], netExts["allowed-address-pairs"], os.routeOpts.AutoConfigSecurityGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the os is already passed to the NewRoutes func. There is no need to pass the os.routeOpts.AutoConfigSecurityGroup.

Comment on lines +86 to +98
if err != nil {
if errors.IsNotFound(err) {
mc := metrics.NewMetricContext("security_group", "create")
klog.V(2).InfoS("Create node security group", "name", sgName)
group, err := groups.Create(r.network, groups.CreateOpts{Name: sgName}).Extract()
if mc.ObserveRequest(err) != nil {
return err
}
sgId = group.ID
} else {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
if errors.IsNotFound(err) {
mc := metrics.NewMetricContext("security_group", "create")
klog.V(2).InfoS("Create node security group", "name", sgName)
group, err := groups.Create(r.network, groups.CreateOpts{Name: sgName}).Extract()
if mc.ObserveRequest(err) != nil {
return err
}
sgId = group.ID
} else {
return err
}
}
if err != nil {
if !errors.IsNotFound(err) {
return err
}
mc := metrics.NewMetricContext("security_group", "create")
klog.V(2).InfoS("Create node security group", "name", sgName)
group, err := groups.Create(r.network, groups.CreateOpts{Name: sgName}).Extract()
if mc.ObserveRequest(err) != nil {
return err
}
r.nodeSecurityGroupId = group.ID
return nil
}

Comment on lines +147 to +156
if r.allowedAddressPairs {
// check whether the related AllowAddressPair is existing
for _, addrPair := range port.AllowedAddressPairs {
if addrPair.IPAddress == route.DestinationCIDR {
return true
}
}
return false
}
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if r.allowedAddressPairs {
// check whether the related AllowAddressPair is existing
for _, addrPair := range port.AllowedAddressPairs {
if addrPair.IPAddress == route.DestinationCIDR {
return true
}
}
return false
}
return true
if !r.allowedAddressPairs {
return true
}
// check whether the related AllowAddressPair is existing
for _, addrPair := range port.AllowedAddressPairs {
if addrPair.IPAddress == route.DestinationCIDR {
return true
}
}
return false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[occm] The pods cannot access each other when across nodes
6 participants