New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod CPU usage check from Windows node summary #122196
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Dec 5 22:11:44 UTC 2023. |
/test |
@knabben: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: knabben The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-verify |
/test pull-kubernetes-e2e-capz-windows-master |
/triage accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream test is using containerd 1.7. Maybe that is the difference on vSpere?
Seems strange we get 0 after the pods run for 2 mins. Any ideas where it might be failing? Is the get summary failing and so we end up with 0? or is the actual reported value for the pod zero?
Yes, I saw this behavior with the pod running in the first <10 seconds, and not after the sleep. The pod was still |
820318c
to
32e7b79
Compare
if this is the case we might be able to wrap that call in retry instead of waiting an additional 2 mins? |
32e7b79
to
26d5570
Compare
Would you like to reproduce this with 1.7, for now only supposing based on the behavior observed on 1.6.
yes, not sure where the 2 minutes are coming from. It still needed to wait an initial amount of time for the So the test can be renamed to |
For this statement is a matter of changing the call to:
|
I generally agree, This test should start up the pod, and once cpu is being consumed we should be verifing that it doesn't go over 500. If that is the case we could so something like gomega.eventually and consistently to achieve that as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/writing-good-e2e-tests.md#polling-and-timeouts |
dd0dd1d
to
a7394d2
Compare
Makes sense in terms of reliability, changed for both functions usage. Function renamed |
We noticed during sig-windows triage today that we are only see this error on 1.7+. Our 1.6 tests cluster does not have this flake. @knabben can you see if the changes work against a 1.7 vSphere cluster or if you can reproduce the error before these changes? |
Running with 1.7.6 on Windows CAPV, these are the errors, seem flaky with the official
|
those look consistent with what we are seeing in capz (#122092). This means it is likely a bug in containerd or the processing of the stats in kubelet |
/hold |
a7394d2
to
e7f6828
Compare
Removing the /unhold |
It seems like this is hiding a bug somewhere else in the stack, we should track that down instead of adjusting the test to allow for inconsistency. |
We've tracked down the bug to containerd: containerd/containerd#9531 |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
/sig windows
What this PR does / why we need it:
Moving CPU usage and stats summary to a Polling function with 2 minutes delay (this cover the other behavior of time.sleep 2 minutes), if limit exceed happens or CPU usage is 0, 2 more retries will happen before timing out, this can reduce the flakes.
Could not replicate the issue on a vSphere cluster, kubelet stats seems stable. Need to confirm the
containerd
version, here is1.6.24
to try to replicate again.Which issue(s) this PR fixes:
Fixes #122092