Issue with Node Exporter Evictions and Stuck Metrics Collection #13600

AlmogSo · 2024-02-19T09:25:31Z

AlmogSo
Feb 19, 2024

Hello Prometheus community,

I hope you're doing well. I'm reaching out to seek assistance with a peculiar issue I'm facing in my Kubernetes environment. The problem involves evictions of Node Exporter pods and occasional instances where the metric collection gets stuck, impacting the availability of crucial metrics.

Problem Description:

In my Kubernetes setup, I'm utilizing Prometheus for monitoring, and I've observed the following issues related to Node Exporter pods:

Evictions: Node Exporter pods are frequently being evicted, leading to disruptions in metric collection.
Stuck Metrics Collection: There are instances when Node Exporter pods are not evicted, but metric collection gets stuck, resulting in outdated or missing metrics.
Symptoms:

Evictions of Node Exporter pods, impacting metric availability.
Periods of time where metrics are not being collected, leading to outdated or missing data.

Troubleshooting Steps Taken:

Checked resource requests and limits for Node Exporter pods.
Reviewed Prometheus configurations for any misconfigurations.
Examined logs and events for evicted Node Exporter pods.
Verified node resource usage and autoscaling behavior.
Questions for the Community:

Has anyone encountered similar issues with Node Exporter pods, including frequent evictions and stuck metrics collection?
Are there specific configurations or best practices for Node Exporter deployment in a Kubernetes environment that I might be overlooking?
I would greatly appreciate any insights or recommendations the community can offer. If additional information is needed, please let me know, and I'll provide it promptly.

Thank you for your time and assistance!

Best regards,
TakNud.

Configuration of Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    meta.helm.sh/release-name: prometheus-stack
    meta.helm.sh/release-namespace: monitoring
  creationTimestamp: '2023-05-08T08:05:08Z'
  generation: 22
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 49.2.0
    chart: kube-prometheus-stack-49.2.0
    heritage: Helm
    k8slens-edit-resource-version: v1
    release: prometheus-stack
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: '2024-02-18T10:05:01Z'
      observedGeneration: 22
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-02-18T10:05:01Z'
      observedGeneration: 22
      status: 'True'
      type: Reconciled
  paused: false
  replicas: 1
  shardStatuses:
    - availableReplicas: 1
      replicas: 1
      shardID: '0'
      unavailableReplicas: 0
      updatedReplicas: 1
  unavailableReplicas: 0
  updatedReplicas: 1
spec:
  alerting:
    alertmanagers:
      - apiVersion: v2
        name: prometheus-stack-kube-prom-alertmanager
        namespace: monitoring
        pathPrefix: /
        port: http-web
  enableAdminAPI: false
  evaluationInterval: 1m
  externalLabels:
    clustername: Bot
  externalUrl: http://prometheus-stack-kube-prom-prometheus.monitoring:9090
  hostNetwork: false
  image: quay.io/prometheus/prometheus:v2.46.0
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  nodeSelector:
    karpenter.sh/provisioner-name: monitor
  paused: false
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      release: prometheus-stack
  portName: http-web
  priorityClassName: system-cluster-critical
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      release: prometheus-stack
  prometheusExternalLabelName: bot
  remoteWrite:
    - sigv4:
        region: us-east-1
      url: >-
        https://aps-workspaces.us-east-1.amazonaws.com/workspaces/XXXXXXXXXXXX/api/v1/remote_write
  replicas: 1
  resources:
    limits:
      cpu: 1200m
      memory: 14Gi
    requests:
      cpu: 1000m
      memory: 6Gi
  retention: 7d
  retentionSize: 20GB
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      release: prometheus-stack
  scrapeInterval: 40s
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  serviceAccountName: amp-iamproxy-ingest-service-account
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      release: prometheus-stack
  shards: 1
  tsdb:
    outOfOrderTimeWindow: 0s
  version: v2.46.0
  walCompression: true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Node Exporter Evictions and Stuck Metrics Collection #13600

{{title}}

Replies: 0 comments

Select a reply

Issue with Node Exporter Evictions and Stuck Metrics Collection #13600

AlmogSo Feb 19, 2024

Replies: 0 comments

AlmogSo
Feb 19, 2024