Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge difference between Kepler power consumption and real PDU power consumption #1349

Open
Tobias-Pe opened this issue Apr 12, 2024 · 3 comments
Labels
kind/bug report bug issue

Comments

@Tobias-Pe
Copy link

Tobias-Pe commented Apr 12, 2024

What happened?

Hi,

I am aiming to include this project for my measurements in my thesis, but currently I don't get reliable data from Kepler.

Here is my Dashboard:

image

So the first row shows my PDU which measures the Watts. The left side is split per server, and the right is the sum over all servers.
The mean is around 150 Watts.

Kepler reaches in the stacked chart barely 45 Watts over all servers and therefore is missing over 100 Watts which don't get recorded?

Here is a picture using your dashboard:

image

What did you expect to happen?

I expect that Kepler reaches with some small error roughly the PDU measured power consumption. Of course, without the overhead of fans and so on.

As a side question:
You use sum by (pod_name, container_namespace) (irate(kepler_container_package_joules_total{container_namespace=~"$namespace", pod_name=~"$pod"}[1m])) as the query in the dashboard

But shouldn't the rate that you use be [$__rate_interval] instead of [1m]?

My query on the first screenshot would be then: sum by(pod_name, container_namespace) (rate(kepler_container_joules_total{container_namespace=~"$namespace", pod_name=~"$pod"}[$__rate_interval])) — I don't need the per computing type differentiation.

How can we reproduce it (as minimally and precisely as possible)?

Compare the power consumption on a smart electricity plug with the one that Kepler delivers.

Anything else we need to know?

Logs of the exporter:

libbpf: sec '.relkprobe/finish_task_switch': relo #23: insn #279 against 'task_clock'                                                                                                                                                                                                │
│ libbpf: prog 'kprobe__finish_task_switch': found map 11 (task_clock, sec 13, off 352) for insn #279                                                                                                                                                                                  │
│ libbpf: sec '.relkprobe/finish_task_switch': relo #24: insn #285 against 'processes'                                                                                                                                                                                                 │
│ libbpf: prog 'kprobe__finish_task_switch': found map 0 (processes, sec 13, off 0) for insn #285                                                                                                                                                                                      │
│ libbpf: sec '.relkprobe/finish_task_switch': relo #25: insn #309 against 'processes'                                                                                                                                                                                                 │
│ libbpf: prog 'kprobe__finish_task_switch': found map 0 (processes, sec 13, off 0) for insn #309                                                                                                                                                                                      │
│ libbpf: sec '.relkprobe/finish_task_switch': relo #26: insn #341 against 'processes'                                                                                                                                                                                                 │
│ libbpf: prog 'kprobe__finish_task_switch': found map 0 (processes, sec 13, off 0) for insn #341                                                                                                                                                                                      │
│ libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'                                                                                                                                                                  │
│ libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #6 against 'processes'                                                                                                                                                                                                 │
│ libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 13, off 0) for insn #6                                                                                                                                                                                                  │
│ libbpf: sec '.relkprobe/mark_page_accessed': collecting relocation for section(7) 'kprobe/mark_page_accessed'                                                                                                                                                                        │
│ libbpf: sec '.relkprobe/mark_page_accessed': relo #0: insn #4 against 'processes'                                                                                                                                                                                                    │
│ libbpf: prog 'kprobe__mark_page_accessed': found map 0 (processes, sec 13, off 0) for insn #4                                                                                                                                                                                        │
│ libbpf: sec '.relkprobe/set_page_dirty': collecting relocation for section(9) 'kprobe/set_page_dirty'                                                                                                                                                                                │
│ libbpf: sec '.relkprobe/set_page_dirty': relo #0: insn #4 against 'processes'                                                                                                                                                                                                        │
│ libbpf: prog 'kprobe__set_page_dirty': found map 0 (processes, sec 13, off 0) for insn #4                                                                                                                                                                                            │
│ libbpf: loading kernel BTF '/sys/kernel/btf/vmlinux': 0                                                                                                                                                                                                                              │
│ libbpf: map 'processes': created successfully, fd=9                                                                                                                                                                                                                                  │
│ libbpf: map 'pid_time': created successfully, fd=10                                                                                                                                                                                                                                  │
│ libbpf: map 'cpu_cycles_event_reader': created successfully, fd=11                                                                                                                                                                                                                   │
│ libbpf: map 'cpu_cycles': created successfully, fd=12                                                                                                                                                                                                                                │
│ libbpf: map 'cpu_ref_cycles_event_reader': created successfully, fd=13                                                                                                                                                                                                               │
│ libbpf: map 'cpu_ref_cycles': created successfully, fd=14                                                                                                                                                                                                                            │
│ libbpf: map 'cpu_instructions_event_reader': created successfully, fd=15                                                                                                                                                                                                             │
│ libbpf: map 'cpu_instructions': created successfully, fd=16                                                                                                                                                                                                                          │
│ libbpf: map 'cache_miss_event_reader': created successfully, fd=17                                                                                                                                                                                                                   │
│ libbpf: map 'cache_miss': created successfully, fd=18                                                                                                                                                                                                                                │
│ libbpf: map 'task_clock_ms_event_reader': created successfully, fd=19                                                                                                                                                                                                                │
│ libbpf: map 'task_clock': created successfully, fd=20                                                                                                                                                                                                                                │
│ libbpf: map 'cpu_freq_array': created successfully, fd=21                                                                                                                                                                                                                            │
│ libbpf: map 'amd64_ke.data': created successfully, fd=22                                                                                                                                                                                                                             │
│ libbpf: map 'amd64_ke.bss': created successfully, fd=23                                                                                                                                                                                                                              │
│ libbpf: sec 'kprobe/finish_task_switch': found 2 CO-RE relocations                                                                                                                                                                                                                   │
│ libbpf: CO-RE relocating [58] struct pt_regs: found target candidate [174] struct pt_regs in [vmlinux]                                                                                                                                                                               │
│ libbpf: prog 'kprobe__finish_task_switch': relo #0: <byte_off> [58] struct pt_regs.di (0:14 @ offset 112)                                                                                                                                                                            │
│ libbpf: prog 'kprobe__finish_task_switch': relo #0: matching candidate #0 <byte_off> [174] struct pt_regs.di (0:14 @ offset 112)                                                                                                                                                     │
│ libbpf: prog 'kprobe__finish_task_switch': relo #0: patched insn #15 (LDX/ST/STX) off 112 -> 112                                                                                                                                                                                     │
│ libbpf: CO-RE relocating [62] struct task_struct: found target candidate [130] struct task_struct in [vmlinux]                                                                                                                                                                       │
│ libbpf: prog 'kprobe__finish_task_switch': relo #1: <byte_off> [62] struct task_struct.tgid (0:86 @ offset 2780)                                                                                                                                                                     │
│ libbpf: prog 'kprobe__finish_task_switch': relo #1: matching candidate #0 <byte_off> [130] struct task_struct.tgid (0:76 @ offset 2500)                                                                                                                                              │
│ libbpf: prog 'kprobe__finish_task_switch': relo #1: patched insn #16 (ALU/ALU64) imm 2780 -> 2500                                                                                                                                                                                    │
│ libbpf: sec 'tracepoint/irq/softirq_entry': found 1 CO-RE relocations                                                                                                                                                                                                                │
│ libbpf: CO-RE relocating [405] struct trace_event_raw_softirq: found target candidate [15895] struct trace_event_raw_softirq in [vmlinux]                                                                                                                                            │
│ libbpf: prog 'kepler_irq_trace': relo #0: <byte_off> [405] struct trace_event_raw_softirq.vec (0:1 @ offset 12)                                                                                                                                                                      │
│ libbpf: prog 'kepler_irq_trace': relo #0: matching candidate #0 <byte_off> [15895] struct trace_event_raw_softirq.vec (0:1 @ offset 8)                                                                                                                                               │
│ libbpf: prog 'kepler_irq_trace': relo #0: patched insn #3 (LDX/ST/STX) off 12 -> 8                                                                                                                                                                                                   │
│ libbpf: prog 'kprobe__finish_task_switch': failed to create kprobe 'finish_task_switch+0x0' perf event: No such file or directory                                                                                                                                                    │
│ I0412 14:38:08.075554 1059339 libbpf_attacher.go:128] failed to attach kprobe/finish_task_switch: failed to attach finish_task_switch k(ret)probe to program kprobe__finish_task_switch: no such file or directory. Try finish_task_switch.isra.0                                    │
│ I0412 14:38:08.130177 1059339 libbpf_attacher.go:195] Successfully load eBPF module from libbpf object                                                                                                                                                                               │
│ I0412 14:38:08.130317 1059339 process_energy.go:114] Using the Ratio/DynPower Power Model to estimate Process Platform Power                                                                                                                                                         │
│ I0412 14:38:08.130338 1059339 process_energy.go:115] Process feature names: [cpu_instructions]                                                                                                                                                                                       │
│ I0412 14:38:08.130428 1059339 process_energy.go:124] Using the Ratio/DynPower Power Model to estimate Process Component Power                                                                                                                                                        │
│ I0412 14:38:08.130452 1059339 process_energy.go:125] Process feature names: [cpu_instructions cpu_instructions cache_miss   gpu_compute_util]                                                                                                                                        │
│ I0412 14:38:08.130889 1059339 node_platform_energy.go:52] Using the Regressor/AbsPower Power Model to estimate Node Platform Power                                                                                                                                                   │
│ I0412 14:38:08.131242 1059339 exporter.go:265] starting to listen on 0.0.0.0:9102                                                                                                                                                                                                    │
│ I0412 14:38:08.131272 1059339 exporter.go:271] Started Kepler in 242.579785ms

Kepler image tag

v.0.7.9

Kubernetes version

$ kubectl version
Client Version: v1.28.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7

Cloud provider or bare metal

bare metal

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux ex-cb02 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Installed using your guide: https://sustainable-computing.io/installation/kepler/

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
NAME         DATA   AGE
kepler-cfm   17     4m44s

$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 
Error from server (NotFound): deployments.apps "kepler-exporter" not found

Kepler exporter:
Name:             kepler-exporter-598fn
Namespace:        kepler
Priority:         0
Service Account:  kepler-sa
Node:             ex-cb03/10.28.1.233
Start Time:       Fri, 12 Apr 2024 16:37:55 +0200
Labels:           app.kubernetes.io/component=exporter
                  app.kubernetes.io/name=kepler-exporter
                  controller-revision-hash=576f666bb8
                  pod-template-generation=1
                  sustainable-computing.io/app=kepler
Annotations:      cni.projectcalico.org/containerID: 01ef734af4557b1b94643b39328494134af6a27e3beb7ca5e458b3de7b510a48
                  cni.projectcalico.org/podIP: 10.1.194.191/32
                  cni.projectcalico.org/podIPs: 10.1.194.191/32
Status:           Running
IP:               10.1.194.191
IPs:
  IP:           10.1.194.191
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  containerd://3ffd53e2f1c9653726e717c6b702a7f58e89e1ac90fb9678e21ecb57e9cd8c42
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      quay.io/sustainable_computing_io/kepler@sha256:c9fda510f6fffcfad473ff5ac6b55521c2cd308a42d1ba318a3172583eaa071a
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/bin/kepler -v=1 -redfish-cred-file-path=/etc/redfish/redfish.csv
    State:          Running
      Started:      Fri, 12 Apr 2024 16:38:09 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:     (v1:status.hostIP)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /etc/redfish from redfish (ro)
      /lib/modules from lib-modules (ro)
      /proc from proc (rw)
      /sys from tracing (ro)
      /var/run from var-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rqfcs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  Directory
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  var-run:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  redfish:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redfish-4kh9d7bc7m
    Optional:    false
  kube-api-access-rqfcs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

microk8s is running high-availability: no datastore master nodes: 10.28.1.232:19001 datastore standby nodes: none addons: enabled: dns # (core) CoreDNS ha-cluster # (core) Configure high availability on the current node helm # (core) Helm - the package manager for Kubernetes helm3 # (core) Helm 3 - the package manager for Kubernetes ingress # (core) Ingress controller for external access metallb # (core) Loadbalancer for your Kubernetes cluster prometheus # (core) Prometheus operator for monitoring and logging disabled: cert-manager # (core) Cloud native certificate management cis-hardening # (core) Apply CIS K8s hardening community # (core) The community addons repository dashboard # (core) The Kubernetes dashboard gpu # (core) Automatic enablement of Nvidia CUDA host-access # (core) Allow Pods connecting to Host services smoothly hostpath-storage # (core) Storage class; allocates storage from host directory kube-ovn # (core) An advanced network fabric for Kubernetes mayastor # (core) OpenEBS MayaStor metrics-server # (core) K8s Metrics Server for API access to service metrics minio # (core) MinIO object storage observability # (core) A lightweight observability stack for logs, traces and metrics rbac # (core) Role-Based Access Control for authorisation registry # (core) Private image registry exposed on localhost:32000 rook-ceph # (core) Distributed Ceph storage using Rook storage # (core) Alias to hostpath-storage add-on, deprecated
@Tobias-Pe Tobias-Pe added the kind/bug report bug issue label Apr 12, 2024
@Tobias-Pe Tobias-Pe changed the title Huge difference between Kepler Power Consumption and real PDU consumption Huge difference between Kepler power consumption and real PDU power consumption Apr 12, 2024
@Tobias-Pe
Copy link
Author

image

@Tobias-Pe
Copy link
Author

image

@Tobias-Pe
Copy link
Author

@sunya-ch @rootfs
sorry to bother you, but I would love to use this tool for my thesis analysis part

Could this have something to do with me doing something wrong when deploying, like we discussed in #1306 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

1 participant