Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet /stats/summary returns Zero from CPU usageNanoCores stats when more than one caller #122092

Open
jsturtevant opened this issue Nov 28, 2023 · 12 comments
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.

Comments

@jsturtevant
Copy link
Contributor

jsturtevant commented Nov 28, 2023

This started as a flakey test but was identified as related to changes to the way usageNanoCores are caulcated and used by kubelet. See #122092 (comment)

Which jobs are flaking?

ci-kubernetes-e2e-capz-master-windows-serial-slow

Which tests are flaking?

[sig-windows] [Feature:Windows] Cpu Resources [Serial] Container limits should not be exceeded after waiting 2 minutes flaking

Since when has it been flaking?

~ November 1st

Testgrid link

https://testgrid.k8s.io/sig-windows-signal#capz-windows-master-serial-slow

Reason for failure (if possible)

STEP: Ensuring pods are still running - test/e2e/windows/cpu_limits.go:56 @ 11/27/23 02:07:27.458
STEP: Ensuring cpu doesn't exceed limit by >5% - test/e2e/windows/cpu_limits.go:76 @ 11/27/23 02:07:27.935
STEP: Gathering node summary stats - test/e2e/windows/cpu_limits.go:78 @ 11/27/23 02:07:27.935
Nov 27 02:07:28.181: INFO: Pod cpulimittest-eb76c25b-1fa9-41e7-a2c0-446bf6db30fb usage: 0
[FAILED] Pod cpu-resources-test-windows-9879/cpulimittest-eb76c25b-1fa9-41e7-a2c0-446bf6db30fb reported usage is 0, but it should be greater than 0
In [It] at: test/e2e/windows/cpu_limits.go:96 @ 11/27/23 02:07:28.182

Anything else we need to know?

example failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-windows-serial-slow/1728940368692514816

Relevant SIG(s)

/sig windows

@jsturtevant jsturtevant added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 28, 2023
@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 28, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsturtevant
Copy link
Contributor Author

After debugging some in containerd (containerd/containerd#9531) we figured out that this is caused because previously in kubelet there was only one caller that calculated and stored the UsageNanoCores as per

// ListPodStatsAndUpdateCPUNanoCoreUsage updates the cpu nano core usage for

The implementation assumes a
// single caller to periodically invoke this function to update the metrics. If
// there exist multiple callers, the period used to compute the cpu usage may
// vary and the usage could be incoherent (e.g., spiky). If no caller calls
// this function, the cpu usage will stay nil. Right now, eviction manager is
// the only caller, and it calls this function every 10s.

Even though we have still have this flag on the kubelet side (

if updateCPUNanoCoreUsage {
usageNanoCores = p.getAndUpdateContainerUsageNanoCores(stats)
} else {
usageNanoCores = p.getContainerUsageNanoCores(stats)
}
)
it is no longer honored in the way it was intended (containerd will always calculate the value and cached the latest value for every call). This means that every time getcontainer stats is called, including eviction manager or some tools going directly to the summary endpoint the values will be recalculated and stored.

At first glance this looks like a containerd issue but containerd doesn't have the information that kubelet has (the flag from eviction manager that says to update the value consistently at 10 seconds).

A few possibilities to solve this:

  1. containerd implements a background process that periodically updates the value every 10seconds and when this endpoint is called it always returns the cached value
  2. add a new cri field like updatecache bool that can be passed
  3. kubelet only calls the cri endpoint when the "update" flag is called, caches at kubelet layer and then on other calls to the summary endpoint it loads from the cache

This is a problem from Windows today but not for Linux since the value is backfilled by cAdvisor still. When Moving to the CRIOnly this would likely become an issue for Linux as well.

/cc @knabben @bobbypage @haircommander

@jsturtevant jsturtevant changed the title [sig-windows] [Feature:Windows] Cpu Resources [Serial] Container limits should not be exceeded after waiting 2 minutes flaking kubelet /stats/summary returns Zero from CPU usageNanoCores stats when more than one caller Feb 13, 2024
@knabben
Copy link
Member

knabben commented Feb 14, 2024

The downside for .1 will be a 10s usagenanocore data delay update, but on the other hand decouples the responsibility from the caller to manage an internal containerd cache, the outcome behavior is similar to the current situation.

I like this approach of the container stats grpc endpoint being cache read-only.

@haircommander
Copy link
Contributor

haircommander commented Feb 14, 2024

A fourth option is kubelet could do the calculation of usage nano cores, instead of CRI. We could deprecate and (eventually) drop the field so they're both not calculating.

Another another option is we could hardcode the update period in the CRI similar to how it's done in the kubelet.

@jsturtevant
Copy link
Contributor Author

I don't know that I have a strong opinion here, who else should we loop in to make this decision? Is this something that should be brought up at a node meeting?

@haircommander
Copy link
Contributor

yeah let's chat about it there

@jsturtevant
Copy link
Contributor Author

I've added to the sig-node agenda for today: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg

@dchen1107
Copy link
Member

@jsturtevant thanks for bringing this to today's SIG Node meeting. @haircommander can you summarize what you shared at today's meeting including the pros and cons for each options here? We will follow up with it. Thanks!

cc/ @logicalhan

@haircommander
Copy link
Contributor

we currently have something like 4 proposals:

  • CRI impl implements a background process that periodically updates the value every 10seconds and when this endpoint is called it always returns the cached value
  • add a new cri field like updatecache bool that can be passed
  • kubelet only calls the cri endpoint when the "update" flag is called, caches at kubelet layer and then on other calls to the summary endpoint it loads from the cache
  • kubelet could do the calculation of usage nano cores, instead of CRI

A point that came up in SIG node: since the CRI spec allows a CRI impl to specify a timestamp for when the cpu metrics were collected, it would be quite tricky for the kubelet to have any control over usage nano seconds. point 4 couldn't work for the same reason it doesn't now: kubelet can't guarantee the period is long enough to have a non-zero usageNanoCores

I think the simplest solution is to have the collection period a hardcoded or configured option in the CRI. That would mean the period is regular and likely long enough to have meaningful data about usageNanoCores.

Note: I think doing point 3 could work as well, in tandem, to allow the kubelet to normalize the needed request to the CRI. This would also mean it wouldn't request the data more frequently than it's being generated. if the CRI is collecting and saving the stats every 10s, and kubelet requests every 5s, then the kubelet will be needlessly collecting half of the metrics it gathers. It would be up to the CRI or admin to choose a period that matches the kubelet's collection period

@dashpole
Copy link
Contributor

The calculation of a rate within the kubelet/cadvisor was always somewhat imprecise. Since it is already part of the CRI API, I would just always use what comes from the CRI, and add let the container runtime decide the optimal caching strategy.

@jsturtevant
Copy link
Contributor Author

Thanks for the discussion, It sounds like we should update Containerd to collect this periodically and returned the cached value. For now, we can hard code it to 10s.

@jsturtevant
Copy link
Contributor Author

I've opened containerd/containerd#10010 with option 1.

There were a few questions in containerd/containerd#10010 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

6 participants