Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus failing to reload Probe TLS cert and key from disk #598

Open
lpetrazickisupgrade opened this issue Mar 6, 2024 · 2 comments
Open

Comments

@lpetrazickisupgrade
Copy link

lpetrazickisupgrade commented Mar 6, 2024

I'm running Prometheus Operator 0.71.2 with Prometheus 2.49.1 on EKS

I have metric endpoints protected by TLS cert and key. Teleport Tbot rotates the cert and key every n hours and writes them to a secret. There's a Probe resource that refers to that secret. Prometheus Operator loads the Probe into a Prometheus instance and rewrites the secret for that instance. Prometheus uses the rewritten secret to access the endpoint

What I'm seeing is that:

  1. Prometheus fails to reload the cert and key and hits a 403 Forbidden for either a couple hours or indefinitely after a cert rotation
  2. Triggering a config reload does not reload the cert and key
  3. Sending a SIGHUP to the Prometheus process does not reload the cert and key
  4. Sending a SIGTERM to the Prometheus process does reload the cert and key by restarting that pod

The secrets look up to date on the Prometheus pod filesystem during the issue

Probe definition:

apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
  name: probe-foo
  namespace: monitoring
spec:
  interval: 30s
  jobName: probe-foo
  prober:
    path: /metrics
    scheme: https
    url: foo-exporter.access-proxy.example.com
  scrapeTimeout: 20s
  targets:
    staticConfig:
      static:
      - foo:443
  tlsConfig:
    cert:
      secret:
        key: tlscert
        name: tbot-prometheus-foo
    insecureSkipVerify: true
    keySecret:
      key: key
      name: tbot-prometheus-foo

Generated config:

- job_name: probe/monitoring/probe-foo
  honor_timestamps: true
  track_timestamps_staleness: false
  scrape_interval: 30s
  scrape_timeout: 20s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /metrics
  scheme: https
  enable_compression: true
  tls_config:
    cert_file: /etc/prometheus/certs/secret_monitoring_tbot-prometheus-foo_tlscert
    key_file: /etc/prometheus/certs/secret_monitoring_tbot-prometheus-foo_key
    insecure_skip_verify: true
  follow_redirects: true
  enable_http2: true
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: (.*)
    target_label: __tmp_prometheus_job_name
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: job
    replacement: office-metrics-foo
    action: replace
  - source_labels: [__address__]
    separator: ;
    regex: (.*)
    target_label: __param_target
    replacement: $1
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - separator: ;
    regex: (.*)
    target_label: __address__
    replacement: foo-exporter.access-proxy.example.com
    action: replace
  - source_labels: [__param_target]
    separator: ;
    regex: (.*)
    modulus: 1
    target_label: __tmp_hash
    replacement: $1
    action: hashmod
  - source_labels: [__tmp_hash]
    separator: ;
    regex: "0"
    replacement: $1
    action: keep
  static_configs:
  - targets:
    - foo:443
    labels:
      namespace: monitoring

This sounds similar to #345 but still happening today

@lpetrazickisupgrade
Copy link
Author

lpetrazickisupgrade commented Mar 7, 2024

What seems to be happening is that Prometheus loads updated certs into new connections but not existing connections:
https://github.com/prometheus/common/blob/v0.50.0/config/http_config.go#L979

Connections are set to remain open unless they are idle for 5 minutes. As long as the scrape interval is significantly shorter than 5 minutes, they remain open indefinitely:
https://github.com/prometheus/common/blob/main/config/http_config.go#L54

One possible enhancement could be for Prometheus to flush any connection that hits a 403 error

@filippog
Copy link

We are seeing the same too, namely k8s tls_config certs are not used for existing connections and eventually prometheus ends up using expired certificates for existing connections.

+1 to flush connections on 403 and/or on cert reload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants