Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus is very slow for query and almost unavailable #13953

Open
Melody-zyy opened this issue Apr 18, 2024 · 3 comments
Open

prometheus is very slow for query and almost unavailable #13953

Melody-zyy opened this issue Apr 18, 2024 · 3 comments

Comments

@Melody-zyy
Copy link

Melody-zyy commented Apr 18, 2024

What did you do?

  • I used thanos-sidecar:v0.30.2 and prometheus:v2.42.0
  • Without any changes,prometheus slows down and almost unavailable

What did you expect to see?

  1. I want to know why this happens(I can provide relative metrics data ).
  2. Which metrics can I use for observation.
  3. How to avoid this case.

What did you see instead? Under which circumstances?

Without any changes, the following phenomena occur:

  • prometheus /metrics api respond slowly,It indicates that prometheus is almost completely unavailable

  • cpu and memory did not experience jitter growth
    image
    image

  • high scrape_duration_seconds metrics:

image

  • error logs:

ts=2024-04-18T14:09:50.169Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:50914: write: broken pipe"
ts=2024-04-18T14:10:21.205Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:59408: write: broken pipe"
ts=2024-04-18T14:10:22.237Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:43920: write: broken pipe"
ts=2024-04-18T14:11:48.881Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:57992: write: broken pipe"

  • high prometheus_http_request_duration_seconds average value metrics:

image

System information

No response

Prometheus version

prometheus version:2.42.0
thanos-sidecar version:v0.30.2

Prometheus configuration file

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  scrape_timeout: 25s
kubernetes_sd_configs:
    - role: pod
scrape_configs:
- job_name: xxx
  scrape_interval: 30s
  scrape_timeout: 30s
  scheme: http
  metrics_path: /metrics
  ...

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

ts=2024-04-18T14:09:48.227Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:58962: write: broken pipe"
ts=2024-04-18T14:09:48.241Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:59676: write: broken pipe"
ts=2024-04-18T14:09:48.539Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:60052: write: broken pipe"
ts=2024-04-18T14:09:50.169Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:50914: write: broken pipe"
ts=2024-04-18T14:10:21.205Z caller=api.go:1578 level=error component=web msg="error writing response" bytesWritten=0 err="write tcp 127.0.0.1:9090->127.0.0.1:59408: write: broken pipe"
@Melody-zyy Melody-zyy changed the title prometheus is very slow for query and scrape_duration_seconds metrics is very high,prometheus is unavailable prometheus is very slow for query and prometheus is almost unavailable Apr 18, 2024
@Melody-zyy Melody-zyy changed the title prometheus is very slow for query and prometheus is almost unavailable prometheus is very slow for query and almost unavailable Apr 18, 2024
@GiedriusS
Copy link
Contributor

Was head GC happening around that time? Do you see that in logs?

@Melody-zyy
Copy link
Author

Was head GC happening around that time? Do you see that in logs?
head gc log has occurred,but the timing doesn't match:
image

@Melody-zyy
Copy link
Author

this is prometheus pprof for goroutine:
image

image
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants