-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kepler stops sending data after 1h #1321
Comments
@AydinMirMohammadi thanks for reporting! Can you get some info when the dashboard stops getting kepler metrics?
kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[1m]" -q ' | jq .data.result[] | jq -r '[.metric.pod_name, .values[0][0], (.values[0][1]|tonumber) ] | @csv'
kubectl logs -n kepler daemonset/kepler-exporter |
Thank you for your support. I restarted the pod at ~17:43 and kepler has start reporting. It stops at ~18:30. From Prometheus I get metrics, but the value don't change. I attached the log and the query (from browser, not cmdline) |
There is nothing concerning in the log. The last 1m metrics are as below. # cat query.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, .values[0][0], (.values[0][1]|tonumber) ] | @csv' |sort -k 4 -g -t"," |tail -5
"eventbus-65dbf87c96-pb27n","idle",1712167266.220,19957
"kernel_processes","idle",1712167266.220,21333
"prometheus-kube-prometheus-stack-prometheus-0","idle",1712167266.220,21333
"calico-node-zlzlv","idle",1712167266.220,31266
"system_processes","idle",1712167266.220,795366 I forgot that the metrics are cumulative, we need two samples at different timestamps to get the delta. Can you try taking the metrics again, preferrably with 3 samples and each with at least 30seconds (default kepler sample interval) apart? In addition, please also get the kepler metrics from kepler pod directly, also 3 times with 30seconds apart, that'll help identify if kepler is still emitting metrics: kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl http://localhost:9102/metrics |grep ^kepler_" |
Hi, Here are the files: Thanks |
Kepler still emits new metrics: in the 2min data: % grep kepler_container_joules Kepler+2min.txt |sort -k 2 -g |tail -5
kepler_container_joules_total{container_id="cad29061d20effdae3861e9ae31dfc912aec40295b831033d26ecc690241c0a8",container_name="prometheus",container_namespace="monitoring",mode="idle",pod_name="prometheus-kube-prometheus-stack-prometheus-0",source=""} 264
kepler_container_joules_total{container_id="6752ceb66a0c652dccba0a985e94f509a08d1cb6ef05fde26ffde04b6a266b0a",container_name="eventbus",container_namespace="app-eshop",mode="idle",pod_name="eventbus-65dbf87c96-dt2lh",source=""} 279
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="dynamic",pod_name="system_processes",source=""} 324
kepler_container_joules_total{container_id="c0ded97c33b174f38bef8f238105995058e193068e81f3d4d8f598d8c356a5fc",container_name="calico-node",container_namespace="calico-system",mode="idle",pod_name="calico-node-kqmm6",source=""} 579
kepler_container_joules_total{container_id="system_processes",container_name="system_processes",container_namespace="system",mode="idle",pod_name="system_processes",source=""} 12165 In the 25min data:
The counters are growing. On the Prometheus side, the numbers are also growing: In the 2min data:
And the 25min data:
Can you check if the metrics e.g. |
Sorry for the missing declaration. The 2min, 25min and 45min are good. They stopped and the 105min is the not good one. And I see - I have forgotten the 110min. I will collect new files |
@AydinMirMohammadi could you please get the Kepler logs when the problem happens? |
Thank you for the support and sorry for the delay The kepler log has not changed after start. I have attached the metrics (kepler and prometheus) after the start and after the problem began. prometheus-metrics-at-start.json |
@AydinMirMohammadi thanks for sharing the info. I think there are something oddly happening on the dashboard, potentially an expired token used by grafana that caused the grafana unable to get the latest promethues metrics. The prometheus query data show the counters are still growing after dashboard stopping (checking pod % cat prometheus-metrics-after-stop-reporting.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv ' |sort -k3 -t"," -g |tail
"kube-proxy-ldcf7","idle",2505
"kube-prometheus-stack-grafana-558998bbc-66gqx","idle",2877
"tigera-operator-6997cbcb7c-vh5sj","idle",3054
"calico-kube-controllers-6c788cf6df-wb242","idle",3712
"kepler-lr4m7","idle",7224
"calico-node-4hsgd","idle",7551
"system_processes","dynamic",9485
"eventbus-65dbf87c96-6mzv2","idle",10291
"prometheus-kube-prometheus-stack-prometheus-0","idle",10962
"system_processes","idle",417893
% cat prometheus-metrics-after-stop-reporting+5min.json | jq '.data.result[]' | jq -r '[.metric.pod_name, .metric.mode, (.value[1]|tonumber)] |@csv ' |sort -k3 -t"," -g |tail
"kube-proxy-ldcf7","idle",2689
"kube-prometheus-stack-grafana-558998bbc-66gqx","idle",3065
"tigera-operator-6997cbcb7c-vh5sj","idle",3311
"calico-kube-controllers-6c788cf6df-wb242","idle",3974
"kepler-lr4m7","idle",7660
"calico-node-4hsgd","idle",8964
"system_processes","dynamic",9485
"eventbus-65dbf87c96-6mzv2","idle",11047
"prometheus-kube-prometheus-stack-prometheus-0","idle",11717
"system_processes","idle",446857 |
@sthaha can you take a look? thanks |
I will check this also an another system. A colleague of me has deployed kepler on an onprem cluster with the same behavior. |
What happened?
I have deployed kepler in a lab environment & and setup defaults including the grafana dashboard. It works as expected.
After one hour no more data is send. See Image
After restarting the kepler pod, data is collected but also stops after one hour. This happens multiple times
What did you expect to happen?
I expect that the data is send continiuosly
How can we reproduce it (as minimally and precisely as possible)?
I have just use the provided helm charts
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
For on kubernetes:
For standalone:
put your Kepler command argument here
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: