Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blackbox doesn't seem to close the established connection to the SSO auth server #941

Closed
Liliia111 opened this issue Jul 1, 2022 · 5 comments

Comments

@Liliia111
Copy link

Host operating system: output of uname -a

Linux ip-XX-XXX-XXX-XXX.us-west-2.compute.internal 5.10.106-102.504.amzn2.x86_64 #1 SMP Tue Mar 29 23:15:13 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

blackbox_exporter version: output of blackbox_exporter --version

blackbox_exporter --version
blackbox_exporter, version 0.21.1 (branch: HEAD, revision: e0d5e4453c54b6e508531a761131d7f21007783c)
  build user:       root
  build date:       20220617-12:22:49
  go version:       go1.18.3
  platform:         linux/amd64

What is the blackbox.yml module config.

modules:
  http_2xx:
    prober: http
    timeout: 30s
    http:
      method: GET
  tcp_connect:
    prober: tcp
    timeout: 15s
  tls_connect:
    prober: tcp
    timeout: 15s
    tcp:
      tls: true
  oauth2:
    prober: http
    timeout: 60s
    http:
      oauth2:
        client_id: "my_client_id"
        client_secret: "my_client_secret"
        scopes:
          - "email"
          - "profile"
          - "openid"
        token_url: "https://my_domain.com/token.oauth2"
        endpoint_params:
          grant_type: "password"
          username: "my_username"
          password: "my_password"

What is the prometheus.yml scrape config.

global:
  evaluation_interval: 1m
  external_labels:
    replica: XX.XXX.XXX.XXX
  scrape_interval: 1m
  scrape_timeout: 20s
rule_files:
- "rules"
- "alerts"
scrape_configs:
.
.
- job_name: sso_mydomain
  metrics_path: /probe
  scrape_timeout: 30s
  scrape_interval: 1m
  params:
    module:
      - oauth2
  static_configs:
    - targets:
      - https://prometheus.my_domain.com/
      - https://alertmanager.my_domain.com/
      - https://thanos.my_domain.com/
      - https://prometheus-us-west-2.my_domain.com/
      - https://alertmanager-us-west-2.my_domain.com/
      - https://thanos-us-west-2.my_domain.com/
      - https://prometheus-us-east-1.my_domain.com/
      - https://alertmanager-us-east-1.my_domain.com/
      - https://thanos-us-east-1.mml.my_domain.com/
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115
.
.

What logging output did you get from adding &debug=true to the probe URL?

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#

No response from Blackbox service

What did you do that produced an error?

Set LimitNOFILE = 1000 to BlackBox service configuration and restart it
Scrape 16 oauth2 metrics per minute and wait 1 hour

What did you expect to see?

Something like that when we restarted Blackbox service

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
Logs for the probe:
ts=2022-07-01T10:18:54.098569302Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Beginning probe" probe=http timeout_seconds=60
ts=2022-07-01T10:18:54.098676009Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolving target address" target=prometheus.my_domain.com ip_protocol=ip6
ts=2022-07-01T10:18:54.111697107Z caller=http.go:336 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Resolved target address" target=prometheus.my_domain.com ip=XX.XXX.XX.XX
ts=2022-07-01T10:18:54.111811482Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url=https://XX.XXX.XX.XX/ host=prometheus.my_domain.com
ts=2022-07-01T10:18:54.558533162Z caller=client.go:498 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received redirect" location="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F"
ts=2022-07-01T10:18:54.558646018Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Making HTTP request" url="https://my_domain.com/authorization.oauth2?approval_prompt=force&client_id=my_client_id&redirect_uri=https%3A%2F%2Fprometheus.my_domain.com%2Foauth2%2Fcallback&response_type=code&scope=openid+email+profile&state=PKnvKS6t75NMU8wQRrNDX5kAmOWxhNJadQQm7tjtTY0%3A%2F" host=
ts=2022-07-01T10:18:54.558675648Z caller=client.go:252 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Address does not match first address, not sending TLS ServerName" first=XX.XXX.XX.XX address=my_domain.com
ts=2022-07-01T10:18:55.228423457Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Received HTTP response" status_code=200
ts=2022-07-01T10:18:55.228539117Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=0 start=2022-07-01T10:18:54.500873437Z dnsDone=2022-07-01T10:18:54.500873437Z connectDone=2022-07-01T10:18:54.514487527Z gotConn=2022-07-01T10:18:54.543274184Z responseStart=2022-07-01T10:18:54.558467253Z tlsStart=2022-07-01T10:18:54.514551374Z tlsDone=2022-07-01T10:18:54.543225678Z end=0001-01-01T00:00:00Z
ts=2022-07-01T10:18:55.228576239Z caller=main.go:144 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Response timings for roundtrip" roundtrip=1 start=2022-07-01T10:18:54.880164122Z dnsDone=2022-07-01T10:18:54.881096846Z connectDone=2022-07-01T10:18:54.955550621Z gotConn=2022-07-01T10:18:55.1050576Z responseStart=2022-07-01T10:18:55.228317479Z tlsStart=2022-07-01T10:18:54.955618587Z tlsDone=2022-07-01T10:18:55.105037353Z end=2022-07-01T10:18:55.228506351Z
ts=2022-07-01T10:18:55.228659807Z caller=main.go:351 module=oauth2 target=https://prometheus.my_domain.com/ level=info msg="Probe succeeded" duration_seconds=1.130064076



Metrics that would have been returned:
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.013047818
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 1.130064076
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 10814
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.088067846
probe_http_duration_seconds{phase="processing"} 0.138452967
probe_http_duration_seconds{phase="resolve"} 0.013980550999999999
probe_http_duration_seconds{phase="tls"} 0.178093063
probe_http_duration_seconds{phase="transfer"} 0.000188865
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 1
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 10814
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.397761283e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns earliest SSL cert expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.677531509e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp seconds
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds 1.677531509e+09
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="25fe13eacb5db65b46a604426a8ef04d8bc9d2b629d8fa4ec280d4ad19c04947"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Contains the TLS version used
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1



Module configuration:
prober: http
timeout: 1m0s
http:
    ip_protocol_fallback: true
    oauth2:
        client_id: my_client_id
        client_secret: <secret>
        client_secret_file: ""
        scopes:
            - email
            - profile
            - openid
        token_url: https://my_domain.com/token.oauth2
        endpoint_params:
            grant_type: password
            password: my_password
            username: my_username
    follow_redirects: true
    enable_http2: true
tcp:
    ip_protocol_fallback: true
icmp:
    ip_protocol_fallback: true
dns:
    ip_protocol_fallback: true
    recursion_desired: true

What did you see instead?

We see no response from Blackbox service

[root@ip-XX-XXX-XXX-XXX ec2-user]# curl 'http://localhost:9115/probe?module=oauth2&debug=true&target=https://prometheus.my_domain.com/'
^C
[root@ip-XX-XXX-XXX-XXX ec2-user]#

We see a lot of etableshed connection to the SSO auth server

[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|wc -l
993
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP1>:443|wc -l
518
[root@ip-XX-XXX-XXX-XXX ec2-user]# netstat -tunp|grep blackbox|grep <SSO_IP2>:443|wc -l
472

We see the following errors

[root@ip-XX-XXX-XXX-XXX ec2-user]# systemctl status blackbox_exporter.service
● blackbox_exporter.service - Blackbox Exporter
   Loaded: loaded (/usr/lib/systemd/system/blackbox_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-07-01 10:14:47 UTC; 1h 10min ago
 Main PID: 19860 (blackbox_export)
   Memory: 87.6M
   CGroup: /system.slice/blackbox_exporter.service
           └─19860 /usr/bin/blackbox_exporter-0.21.1.linux-amd64/blackbox_exporter --config.file /etc/prometheus/blackbox_exporter.yml

Jul 01 11:24:40 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:40 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:41 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:41 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:42 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:42 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:43 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:43 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:44 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:44 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:45 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:45 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:46 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:46 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:47 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:47 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:48 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:48 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
Jul 01 11:24:49 ip-XX-XXX-XXX-XXX.us-west-2.compute.internal blackbox_exporter[19860]: 2022/07/01 11:24:49 http: Accept error: accept tcp [::]:9115: accept4: too many open files; retrying in 1s
@day4me
Copy link

day4me commented Jul 5, 2022

I have the same issue. Would appreciate it if someone could take a look into it.

@sergeykucher
Copy link

I was faced with the same issue. One solution is to restart the blackbox_exporter service with crontab.

@roidelapluie
Copy link
Member

Hello

I have worked on a fix here: prometheus/common#390

@1337andre
Copy link

I think it was fixed with 0.22.0. is running for 2 days without any problems

@electron0zero
Copy link
Member

closing this, this was fixed in https://github.com/prometheus/common/releases/tag/v0.36.0.

we are at github.com/prometheus/common:v0.37.0 in black box exporter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants