Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sync_proxy_rules_no_endpoints_total metric #108930

Merged
merged 5 commits into from Apr 1, 2022

Conversation

MaxRenaud
Copy link
Contributor

@MaxRenaud MaxRenaud commented Mar 23, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it: Adds a metric when a service with a local traffic_policy has no available endpoints. The metric has 2 labels:

  • "internal" when the service uses the internal traffic policy.
  • "external" when the service uses the external traffic policy.

Which issue(s) this PR fixes:

Fixes #2086

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Introduction of a new "sync_proxy_rules_no_local_endpoints_total" proxy metric. This metric represents the number of services with no internal endpoints. The "traffic_policy" label will contain both "internal" or "external".

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/c62cfc32ad2d9d247ade41d0f6004f3de8fd72a6/keps/sig-network/2086-service-internal-traffic-policy/README.md

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 23, 2022
@MaxRenaud
Copy link
Contributor Author

/test all

@k8s-ci-robot k8s-ci-robot added area/ipvs sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 23, 2022
@MaxRenaud
Copy link
Contributor Author

/assign andrewsykim
/sig network

@MaxRenaud MaxRenaud marked this pull request as ready for review March 23, 2022 18:33
@MaxRenaud MaxRenaud changed the title [WIP] Add sync_proxy_rules_no_endpoints_total metric Add sync_proxy_rules_no_endpoints_total metric Mar 23, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2022
localNoEndpoints int64
// externalNoEndpoints represents the number of rules that couldn't be applied due to
// the absence of any endpoints.
externalNoEndpoints int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would name this clusterNoEndpoints instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1327,6 +1329,11 @@ func (proxier *Proxier) syncProxyRules() {
}

if !hasEndpoints {
if svc.NodeLocalInternal() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there may be some changes need after #106497 is merged. I don't expect the conflict to be big but just a heads up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

@@ -273,6 +273,12 @@ type Proxier struct {
// Inject for test purpose.
networkInterfacer utilproxy.NetworkInterfacer
gracefuldeleteManager *GracefulTerminationManager
// localNoEndpoints represents the number of rules that couldn't be applied due to
// the absence of local endpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment reference the metric this is used for would be helpful, same for clusterNoEndpoints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for both.

@@ -967,6 +967,8 @@ func (proxier *Proxier) syncProxyRules() {
}
}

serviceNoEndpointsTotalInternal := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here for the metric this is used for would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@MaxRenaud
Copy link
Contributor Author

/retest unrelated

@k8s-ci-robot
Copy link
Contributor

@MaxRenaud: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-large-performance
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-no-stage
  • /test pull-kubernetes-e2e-gce-ubuntu
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd-canary
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-files-remake
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-go-canary
  • /test pull-kubernetes-update
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-verify-govet-levee

The following commands are available to trigger optional jobs:

  • /test check-dependency-stats
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-aks-engine-azure-disk-windows-containerd
  • /test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd
  • /test pull-kubernetes-e2e-aks-engine-windows-containerd
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-ha-control-plane
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-e2e-gce-alpha-features
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-e2e-gce-iscsi
  • /test pull-kubernetes-e2e-gce-iscsi-serial
  • /test pull-kubernetes-e2e-gce-kubetest2
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-e2e-iptables-azure-dualstack
  • /test pull-kubernetes-e2e-ipvs-azure-dualstack
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-kind-multizone
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-e2e-ubuntu-gce-network-policies
  • /test pull-kubernetes-e2e-windows-gce
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-crio-e2e-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-containerd-features-kubetest2
  • /test pull-kubernetes-node-e2e-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-containerd
  • /test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
  • /test pull-kubernetes-node-memoryqos-cgrpv2
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-node-swap-fedora-serial
  • /test pull-kubernetes-node-swap-ubuntu-serial
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate

Use /test all to run the following jobs that were automatically triggered:

  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-gce-100-performance
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-e2e-gci-gce-ipvs
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-integration
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-typecheck
  • pull-kubernetes-unit
  • pull-kubernetes-verify
  • pull-kubernetes-verify-govet-levee

In response to this:

/retest unrelated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MaxRenaud
Copy link
Contributor Author

/retest

2 similar comments
@MaxRenaud
Copy link
Contributor Author

/retest

@MaxRenaud
Copy link
Contributor Author

/retest

@logicalhan
Copy link
Member

/assign @dgrisonnet
/triage accepted

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2022
@MaxRenaud
Copy link
Contributor Author

Verify check needs to be addressed:

{Script Error ScriptError pkg/proxy/ipvs/proxier.go:281:14: "mulitple" is a misspelling of "multiple"
pkg/proxy/ipvs/proxier.go:288:14: "mulitple" is a misspelling of "multiple"}

Done

Comment on lines 1374 to 1373
args = append(args[:0],
"-A", string(policyLocalChain),
"-m", "comment", "--comment",
fmt.Sprintf(`"%s has no local endpoints"`, svcNameString),
"-j",
string(KubeMarkDropChain),
)
proxier.natRules.Write(args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... that use of args is unnecessary with the current code...

Suggested change
args = append(args[:0],
"-A", string(policyLocalChain),
"-m", "comment", "--comment",
fmt.Sprintf(`"%s has no local endpoints"`, svcNameString),
"-j",
string(KubeMarkDropChain),
)
proxier.natRules.Write(args)
proxier.natRules.Write(
"-A", string(policyLocalChain),
"-m", "comment", "--comment",
fmt.Sprintf(`"%s has no local endpoints"`, svcNameString),
"-j",
string(KubeMarkDropChain),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just moved the code. Is it OK if we keep it as is for this PR? I'm getting an exception since we are past the freeze and my justification for the low risk is that we are simply adding a metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will tickle a rebase on my PR which touches the args usage, too, I think

{"10.0.1.3", "host2"},
},
expectedSyncProxyRulesNoLocalEndpointsTotalInternal: 1,
expectedSyncProxyRulesNoLocalEndpointsTotalExternal: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have some test where the metric is more than 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this as a quick follow-up? It involves refactoring the test to include dynamically generating services. Since we're after the freeze, I'd like to get this one in

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Mar 30, 2022
@thockin
Copy link
Member

thockin commented Mar 31, 2022

I think this is OK, even if some minor fixups would be nice.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 31, 2022
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 31, 2022
@andrewsykim
Copy link
Member

/milestone v1.24
/hold cancel

(exception for this PR was approved)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 31, 2022
@pacoxu
Copy link
Member

pacoxu commented Apr 1, 2022

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 1, 2022

@MaxRenaud: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce-storage-snapshot 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-storage-snapshot
pull-kubernetes-e2e-gce-storage-slow 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-storage-slow
pull-kubernetes-e2e-gce-iscsi-serial 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-iscsi-serial
pull-kubernetes-e2e-gce-iscsi 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-iscsi
pull-kubernetes-e2e-gce-csi-serial 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-csi-serial
pull-kubernetes-e2e-aks-engine-windows-containerd 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-aks-engine-windows-containerd
pull-publishing-bot-validate 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-publishing-bot-validate
check-dependency-stats 93530379e00402bf59604fee3a1582ddad330187 link false /test check-dependency-stats
pull-kubernetes-e2e-capz-conformance 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-capz-conformance
pull-kubernetes-e2e-capz-azure-file-vmss 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-capz-azure-file-vmss
pull-kubernetes-e2e-capz-azure-disk 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-capz-azure-disk
pull-kubernetes-e2e-gce-alpha-features 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gce-alpha-features
pull-kubernetes-e2e-gci-gce-ingress 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-gci-gce-ingress
pull-kubernetes-e2e-capz-azure-disk-vmss 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-capz-azure-disk-vmss
pull-kubernetes-e2e-ubuntu-gce-network-policies 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-ubuntu-gce-network-policies
pull-kubernetes-e2e-capz-azure-file 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-e2e-capz-azure-file
pull-kubernetes-conformance-kind-ipv6-parallel 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-conformance-kind-ipv6-parallel
pull-kubernetes-conformance-image-test 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-conformance-image-test
pull-kubernetes-node-kubelet-credential-provider 93530379e00402bf59604fee3a1582ddad330187 link false /test pull-kubernetes-node-kubelet-credential-provider

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@MaxRenaud
Copy link
Contributor Author

This is ready to merge. It's been rebased, modified to fit with the conflicts, tested, and all comments were addressed.

The absolute latest this can be merged is "01:00 UTC Saturday 2nd April 2022" in order to make it into 1.24. That translates to 18:00 PDT or 21:00 EDT today.

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @MaxRenaud!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 978d968 into kubernetes:master Apr 1, 2022
SIG Node CI/Test Board automation moved this from Archive-it to Done Apr 1, 2022
SIG Node PR Triage automation moved this from Waiting on Author to Done Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/ipvs area/kubectl area/kubelet area/network-policy Issues or PRs related to Network Policy subproject area/provider/gcp Issues or PRs related to gcp provider area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.
Development

Successfully merging this pull request may close these issues.

None yet