Kepler stops sending data after 1h #1321

AydinMirMohammadi · 2024-04-01T19:39:17Z

What happened?

I have deployed kepler in a lab environment & and setup defaults including the grafana dashboard. It works as expected.

After one hour no more data is send. See Image

After restarting the kepler pod, data is collected but also stops after one hour. This happens multiple times

What did you expect to happen?

I expect that the data is send continiuosly

How can we reproduce it (as minimally and precisely as possible)?

I have just use the provided helm charts

Anything else we need to know?

No response

Kepler image tag

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:58:16Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"windows/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.9", GitCommit:"d33c44091f0e760b0053f06023e87a1c99dfd302", GitTreeState:"clean", BuildDate:"2024-01-31T01:58:06Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or bare metal

Azure AKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs · 2024-04-02T23:39:24Z

@AydinMirMohammadi thanks for reporting! Can you get some info when the dashboard stops getting kepler metrics?

Get the past 1m kepler metrics from prometheus to make sure kepler is still collecting metrics

kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[1m]" -q ' |  jq .data.result[] | jq -r '[.metric.pod_name, .values[0][0], (.values[0][1]|tonumber) ] | @csv'

get the kepler log

kubectl logs -n kepler daemonset/kepler-exporter

AydinMirMohammadi · 2024-04-03T18:05:43Z

Thank you for your support.

I restarted the pod at ~17:43 and kepler has start reporting. It stops at ~18:30.

From Prometheus I get metrics, but the value don't change. I attached the log and the query (from browser, not cmdline)
Kepler Log.txt
query.json

rootfs · 2024-04-03T18:30:15Z

There is nothing concerning in the log.

The last 1m metrics are as below.

# cat query.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, .values[0][0], (.values[0][1]|tonumber) ] | @csv' |sort -k 4 -g -t"," |tail -5
"eventbus-65dbf87c96-pb27n","idle",1712167266.220,19957
"kernel_processes","idle",1712167266.220,21333
"prometheus-kube-prometheus-stack-prometheus-0","idle",1712167266.220,21333
"calico-node-zlzlv","idle",1712167266.220,31266
"system_processes","idle",1712167266.220,795366

I forgot that the metrics are cumulative, we need two samples at different timestamps to get the delta. Can you try taking the metrics again, preferrably with 3 samples and each with at least 30seconds (default kepler sample interval) apart?

In addition, please also get the kepler metrics from kepler pod directly, also 3 times with 30seconds apart, that'll help identify if kepler is still emitting metrics:

kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl  http://localhost:9102/metrics |grep ^kepler_"

AydinMirMohammadi · 2024-04-04T15:46:17Z

Hi,
just restart the pod and collect the logs from browser. After one hour, in grafana, the reporting stops.

Here are the files:
Kepler+2min.txt
Kepler+25min.txt
Kepler+45min.txt
Kepler+105min.txt
Prometheus+2min.json
Prometheus+25min.json
Prometheus+45min.json
Prometheus+105min.json

Thanks
Aydin

rootfs · 2024-04-06T15:09:05Z

Kepler still emits new metrics:

in the 2min data:

% grep kepler_container_joules Kepler+2min.txt |sort -k 2 -g |tail -5
kepler_container_joules_total{container_id="cad29061d20effdae3861e9ae31dfc912aec40295b831033d26ecc690241c0a8",container_name="prometheus",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-prometheus-stack-prometheus-0",source=""} 264
kepler_container_joules_total{container_id="6752ceb66a0c652dccba0a985e94f509a08d1cb6ef05fde26ffde04b6a266b0a",container_name="eventbus",container_namespace="app-eshop",mode="idle",pod_name="eventbus-65dbf87c96-dt2lh",source=""} 279
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes",source=""} 324
kepler_container_joules_total{container_id="c0ded97c33b174f38bef8f238105995058e193068e81f3d4d8f598d8c356a5fc",container_name="calico-node",container_namespace="calico-system",mode="idle",pod_name="calico-node-kqmm6",source=""} 579
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes",source=""} 12165

In the 25min data:

% grep kepler_container_joules Kepler+25min.txt |sort -k 2 -g |tail -5
kepler_container_joules_total{container_id="6752ceb66a0c652dccba0a985e94f509a08d1cb6ef05fde26ffde04b6a266b0a",container_name="eventbus",container_namespace="app-eshop",mode="idle",pod_name="eventbus-65dbf87c96-dt2lh",source=""} 3375
kepler_container_joules_total{container_id="cad29061d20effdae3861e9ae31dfc912aec40295b831033d26ecc690241c0a8",container_name="prometheus",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-prometheus-stack-prometheus-0",source=""} 3458
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes",source=""} 4015
kepler_container_joules_total{container_id="c0ded97c33b174f38bef8f238105995058e193068e81f3d4d8f598d8c356a5fc",container_name="calico-node",container_namespace="calico-system",mode="idle",pod_name="calico-node-kqmm6",source=""} 5494
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes",source=""} 152060

The counters are growing.

On the Prometheus side, the numbers are also growing:

In the 2min data:

% cat Prometheus+2min.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv '  |sort -k3 -t"," -g |tail
"postgres-7f889f7d74-d4f6p","idle",23
"calico-kube-controllers-6c788cf6df-k2k64","idle",26
"tigera-operator-78f7d86fc6-jqqj4","idle",28
"kube-prometheus-stack-grafana-558998bbc-tvl5w","idle",31
"kepler-9m7wc","idle",89
"prometheus-kube-prometheus-stack-prometheus-0","idle",118
"eventbus-65dbf87c96-dt2lh","idle",131
"system_processes","dynamic",133
"calico-node-kqmm6","idle",276
"system_processes","idle",4976

And the 25min data:

 % cat Prometheus+25min.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv '  |sort -k3 -t"," -g |tail
"kube-prometheus-stack-grafana-558998bbc-tvl5w","idle",705
"kube-proxy-7s5ng","idle",750
"calico-kube-controllers-6c788cf6df-k2k64","idle",1112
"tigera-operator-78f7d86fc6-jqqj4","idle",1213
"kepler-9m7wc","idle",2381
"eventbus-65dbf87c96-dt2lh","idle",3214
"prometheus-kube-prometheus-stack-prometheus-0","idle",3287
"system_processes","dynamic",3827
"calico-node-kqmm6","idle",5246
"system_processes","idle",144942

Can you check if the metrics e.g. sum(rate(kepler_container_joules_total[1m])) on the prometheus UI? It could be somewhere in the grafana and prometheus connection (e.g. token expiration)

AydinMirMohammadi · 2024-04-07T18:08:22Z

Sorry for the missing declaration. The 2min, 25min and 45min are good. They stopped and the 105min is the not good one.

And I see - I have forgotten the 110min. I will collect new files

marceloamaral · 2024-04-08T00:59:05Z

@AydinMirMohammadi could you please get the Kepler logs when the problem happens?

AydinMirMohammadi · 2024-04-12T10:48:42Z

Thank you for the support and sorry for the delay

The kepler log has not changed after start.
kepler-log-after-stop-reporting.txt

I have attached the metrics (kepler and prometheus) after the start and after the problem began.
kepler-metrics-at-start.txt
kepler-metrics-after-stop-reporting.txt
kepler-metrics-after-stop-reporting+5min.txt

prometheus-metrics-at-start.json
prometheus-metrics-after-stop-reporting.json
prometheus-metrics-after-stop-reporting+5min.json

rootfs · 2024-04-12T13:58:51Z

@AydinMirMohammadi thanks for sharing the info. I think there are something oddly happening on the dashboard, potentially an expired token used by grafana that caused the grafana unable to get the latest promethues metrics.

The prometheus query data show the counters are still growing after dashboard stopping (checking pod prometheus-kube-prometheus-stack-prometheus-0 in stopping and stopping+5min)

% cat prometheus-metrics-after-stop-reporting.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv '  |sort -k3 -t"," -g |tail
"kube-proxy-ldcf7","idle",2505
"kube-prometheus-stack-grafana-558998bbc-66gqx","idle",2877
"tigera-operator-6997cbcb7c-vh5sj","idle",3054
"calico-kube-controllers-6c788cf6df-wb242","idle",3712
"kepler-lr4m7","idle",7224
"calico-node-4hsgd","idle",7551
"system_processes","dynamic",9485
"eventbus-65dbf87c96-6mzv2","idle",10291
"prometheus-kube-prometheus-stack-prometheus-0","idle",10962
"system_processes","idle",417893
% cat prometheus-metrics-after-stop-reporting+5min.json  | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv '  |sort -k3 -t"," -g |tail
"kube-proxy-ldcf7","idle",2689
"kube-prometheus-stack-grafana-558998bbc-66gqx","idle",3065
"tigera-operator-6997cbcb7c-vh5sj","idle",3311
"calico-kube-controllers-6c788cf6df-wb242","idle",3974
"kepler-lr4m7","idle",7660
"calico-node-4hsgd","idle",8964
"system_processes","dynamic",9485
"eventbus-65dbf87c96-6mzv2","idle",11047
"prometheus-kube-prometheus-stack-prometheus-0","idle",11717
"system_processes","idle",446857

rootfs · 2024-04-12T14:00:49Z

@sthaha can you take a look? thanks

AydinMirMohammadi · 2024-04-12T14:03:30Z

I will check this also an another system. A colleague of me has deployed kepler on an onprem cluster with the same behavior.

AydinMirMohammadi added the kind/bug report bug issue label Apr 1, 2024

rootfs mentioned this issue Apr 6, 2024

Instability when running in AWS EKS on some instance families #1310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kepler stops sending data after 1h #1321

Kepler stops sending data after 1h #1321

AydinMirMohammadi commented Apr 1, 2024

put your Kepler command argument here

rootfs commented Apr 2, 2024

AydinMirMohammadi commented Apr 3, 2024

rootfs commented Apr 3, 2024

AydinMirMohammadi commented Apr 4, 2024

rootfs commented Apr 6, 2024 •

edited

AydinMirMohammadi commented Apr 7, 2024 •

edited

marceloamaral commented Apr 8, 2024

AydinMirMohammadi commented Apr 12, 2024

rootfs commented Apr 12, 2024

rootfs commented Apr 12, 2024

AydinMirMohammadi commented Apr 12, 2024

Kepler stops sending data after 1h #1321

Kepler stops sending data after 1h #1321

Comments

AydinMirMohammadi commented Apr 1, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Install tools

Kepler deployment config

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented Apr 2, 2024

AydinMirMohammadi commented Apr 3, 2024

rootfs commented Apr 3, 2024

AydinMirMohammadi commented Apr 4, 2024

rootfs commented Apr 6, 2024 • edited

AydinMirMohammadi commented Apr 7, 2024 • edited

marceloamaral commented Apr 8, 2024

AydinMirMohammadi commented Apr 12, 2024

rootfs commented Apr 12, 2024

rootfs commented Apr 12, 2024

AydinMirMohammadi commented Apr 12, 2024

rootfs commented Apr 6, 2024 •

edited

AydinMirMohammadi commented Apr 7, 2024 •

edited