Kepler reports unrealistic measurements for short period #1344

bjornpijnacker · 2024-04-11T12:51:21Z

What happened?

Kepler shows some measurements in the dashboard that seem wrong. See the screenshot. Reported usage is about ~9kW which cannot be correct as this clusters consists of two PCs with a 35W power supply each.

What did you expect to happen?

The measurements to be correct.

How can we reproduce it (as minimally and precisely as possible)?

Unknown. I was not doing anything special with the cluster when this happened; in fact, I was asleep. It has happened twice so far; the other time at ~5kW a few days earlier. No logging exists from this time in Kepler.

Anything else we need to know?

No response

Kepler image tag

quay.io/sustainable_computing_io/kepler:release-0.7.8

Kubernetes version

$ kubectl version --output=yaml
clientVersion:
  buildDate: "2023-08-24T11:23:10Z"
  compiler: gc
  gitCommit: 8dc49c4b984b897d423aab4971090e1879eb4f23
  gitTreeState: clean
  gitVersion: v1.28.1
  goVersion: go1.20.7
  major: "1"
  minor: "28"
  platform: linux/amd64
kustomizeVersion: v5.0.4-0.20230601165947-6ce0bf390ce3
serverVersion:
  buildDate: "2024-01-17T13:38:41Z"
  compiler: gc
  gitCommit: 0fa26aea1d5c21516b0d96fea95a77d8d429912e
  gitTreeState: clean
  gitVersion: v1.27.10
  goVersion: go1.20.13
  major: "1"
  minor: "27"
  platform: linux/amd64

Cloud provider or bare metal

Bare metal

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

$ uname -a
Linux mycluster-cp1 5.15.0-97-generic #107-Ubuntu SMP Wed Feb 7 13:26:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
</details>


### Install tools

<details>
Used RKE1 to install cluster
</details>


### Kepler deployment config

<details>

For on kubernetes:
```console
$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# doesn't exist

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 
# doesn't exist; kepler is a DaemonSet

Container runtime (CRI) and version (if applicable)

Docker version 25.0.3

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI: rancher/flannel-cni:v0.3.0-rancher8

rootfs · 2024-04-11T14:11:56Z

There are similar issues reported elsewhere. We have not been able to reproduced yet.

For debugging, can you get the `sum(kepler_container_joules_total) from prometheus during the spike time? That'll help us find whether this is due to the kepler metrics or from the calculation used in the grafana dashboard.

bjornpijnacker · 2024-04-12T10:09:26Z

That gives me ~22.3Mil summing over the half hour of the spike. Another spike has happened since with ~37.7Mil sum. Each of the three spikes seems to last close to exactly half an hour.

Hope this helps, if you need more info do let me know!

rootfs · 2024-04-12T13:51:16Z

Thanks @bjornpijnacker The two potential issues are:

kepler metrics overflow. We have seen RAPL overflow before but fixes have been put in for a while.
calculation-led overflow in grafana. The kepler metrics are used to calculate rate() or irate() in the dashboard caused this overflow. This could happen if there are mismatched data type or timestamp.

We have to narrow down the scenarios. For the first case, it is best to also check the prometheus graph to see if the raw kepler metric sum(kepler_container_joules_total) has any spike.

bjornpijnacker · 2024-04-14T10:20:16Z

This is one of dashboard graphs where the spikes are evident. This is the last 7 days in the default dashboard. Below is sum(kepler_container_joules_total) and sum(rate(kepler_container_joules_total[1m])) in the last 7 days respectively.

geurjas · 2024-05-14T12:16:26Z

Unfortunately, we got the same issue on our installation (baremetal).
Kepler version: release-0.7.8

geurjas · 2024-06-07T12:46:18Z

After we downgraded to kepler 0.7.2, values reporting are stable again.
See also #1279

bjornpijnacker added the kind/bug report bug issue label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kepler reports unrealistic measurements for short period #1344

Kepler reports unrealistic measurements for short period #1344

bjornpijnacker commented Apr 11, 2024

rootfs commented Apr 11, 2024

bjornpijnacker commented Apr 12, 2024

rootfs commented Apr 12, 2024

bjornpijnacker commented Apr 14, 2024

geurjas commented May 14, 2024 •

edited

geurjas commented Jun 7, 2024

Kepler reports unrealistic measurements for short period #1344

Kepler reports unrealistic measurements for short period #1344

Comments

bjornpijnacker commented Apr 11, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented Apr 11, 2024

bjornpijnacker commented Apr 12, 2024

rootfs commented Apr 12, 2024

bjornpijnacker commented Apr 14, 2024

geurjas commented May 14, 2024 • edited

geurjas commented Jun 7, 2024

geurjas commented May 14, 2024 •

edited