Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: high cpu usage after prometheus.lua report 'no memory' #10000

Closed
wklken opened this issue Aug 10, 2023 · 10 comments
Closed

bug: high cpu usage after prometheus.lua report 'no memory' #10000

wklken opened this issue Aug 10, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@wklken
Copy link

wklken commented Aug 10, 2023

Current Behavior

  apisixConfig:
    luaSharedDict:
      .......
      prometheus-metrics: 10m

we make prometheus-metrics shared dict 10m, and after deployed online (1 deployment 8 pods) for about 7 days, the memory for each pod has been exhausted one by one.

we take a look at each pod which been killed because the cpu hit the resources limit


from the grafana dashboard and the error log

image

when the metrics lost data, it means /metrics not response in 30 seconds, maybe the response is huge?

the error log only have no memory present

[error] 76#76: *62508500 [lua] prometheus.lua:920: log_error(): Error while setting 'etcd_modify_indexes{key="x_etcd_index"}' to '98387': 'no memory', client: 1.1.1.1, server: , request: "GET /metrics HTTP/1.1", host: "0.0.0.0:6008"

and after few hours, the apisix container will hit the cpu limits and been restarted.

image

before the container hit the high cpu limit, it has been report no memory for few hours.

we have many other environments, and if the environment has restarts, we redeploy the apisix, and no restarts before the prometheus report no memory.

Expected Behavior

no hig cpu usage even the prometheus.lua report 'no memory'

Error Logs

[error] 76#76: *62508500 [lua] prometheus.lua:920: log_error(): Error while setting 'etcd_modify_indexes{key="x_etcd_index"}' to '98387': 'no memory', client: 1.1.1.1, server: , request: "GET /metrics HTTP/1.1", host: "0.0.0.0:6008"

Steps to Reproduce

  1. set prometheus_metrics to a limited size
  2. deploy it with resources cpu limit
  3. add the /metrics as a prometheus target, Scrape Duration to 1 second

Environment

  • APISIX version (run apisix version): 3.2.0
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.4
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@wklken
Copy link
Author

wklken commented Aug 10, 2023

from the log:

  1. apisix.node_listen port is ok before been restarted, access.log
  2. prometheus.lua:920 no memory in error.log
  3. the prometheus can't scrape the /metrics from apisix for few hours

the service monitor config

    interval: "30s"
    scrapeTimeout: "30s"

so, if the response too slow(maybe the metrics data too huge?), it will timeout and lost the line in chart.

if the prometheus privileged process keep receiving the /metrics for hours, all timeout => a high cpu usage => the livenessProbe/readinessProbe to the apisix will be affected, timeout (context deadline exceeded (Client Timeout exceeded while awaiting headers) => the container been restarted

@wklken
Copy link
Author

wklken commented Aug 10, 2023

image image

if the /metrics about 15-20M, the prometheus scrape will cause the cpu to 100%

@wklken
Copy link
Author

wklken commented Aug 11, 2023

  • limit: 48000 lines, about 14M
  • http_status = route / bandwidth = type * route / http_latency = type * route * (len(default_buckets) + 1)
  • route + 3 * route + 3 * route * (n + 1) <= 48000
  • (7 + 3n) * route <= 48000 (n >=1)

  • version >= 3.4.x, we can set default_buckets; if n = 1, route <= 4800;
  • version <= 3.4, we can't set default_buckets; n=15, route <= 923;

@Revolyssup Revolyssup added the bug Something isn't working label Aug 11, 2023
@Revolyssup
Copy link
Contributor

related etcd issue: #7353

@Revolyssup
Copy link
Contributor

@wklken This was identified as an etcd issue and was fixed in etcd release 3.5.5 etcd-io/etcd#14138

@wklken
Copy link
Author

wklken commented Sep 18, 2023

@Revolyssup
I use apisix 3.2.1(the prometheus is a privileged process), and use http to connect with etcd

deployment:
  etcd:
    host:
      - "http://bk-apigateway-etcd:2379"

so I'm not sure if the huge metrics cause high cpu usage is related to #7345

@Revolyssup
Copy link
Contributor

For the no memory issue, can you increase the size of shared_dict? Because it keeps retrying when out of memory and the CPU usage might have that correlation.

@wklken
Copy link
Author

wklken commented Sep 19, 2023

@Revolyssup

It's also not the memory issue, we change the shared_dict prometheus-metrics: 50m, and while we have registered about 20000 routes, and just curl each of them, the /metrics is about 15-20M, at that time, curl /metrics will cause 100% cpu usage of the container, then hit the limits and been restarted.

You can use loop curl post /routes to create 20000 routes and just curl each of them.

It may take a lot of cpu usage for the prometheus plugin to dump(or calculate) the data from memory into response?

Currently, I have to increase the limits and patch a trigger to disable the official prometheus metrics.

@Revolyssup
Copy link
Contributor

@wklken The performance is not possible to fix. But as many users wish, they need configurable built-in metrics (not hardcode), since some metrics define some labels that vary greatly in some cases, making the number of metric variants increase a lot. We do not delete outdated metrics, so each time pulled by prometheus, the CPU increases a lot. we always advise that: custom the metrics as they need when they use prometheus plugin. So in this case, the patch you have is only solution. You can close this issue if this answers your questions.

@wklken
Copy link
Author

wklken commented Sep 19, 2023

ok, understood! the #9673 make it possible to decrease the amount of /metrics, but it's nice to have a config to disable all.

close, thanks for your response.

@wklken wklken closed this as completed Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants