Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/-/healthy should fail if metrics are inconsistent. #544

Open
tcolgate opened this issue Mar 24, 2023 · 2 comments
Open

/-/healthy should fail if metrics are inconsistent. #544

tcolgate opened this issue Mar 24, 2023 · 2 comments

Comments

@tcolgate
Copy link

tcolgate commented Mar 24, 2023

Feature request

We are seeing pushgateway occasionally get into a state where it will not accept metrics. The UI reports everything as "last push failed", and no new metrics are collected. This is rare, and only happens in prod. Killing the process fixes the issue (so it doesn't seem to be persisted state). It looks a lot like the situation with identical metrics across groups, but seems to impact everything. At this point the pod does seem to still serve /metrics (prom see samples from scrapes).

If client are submitting bad data, then one option would be to run with --push.disable-consistency-check. And then wait for /metrics scrapes to fail, and have the pod die. An even nicer approach would be that, once scrapes fail, /-/healthy should also fail (the process is literally unhealthy, and wont serve metrics), allowing orchestration to kill it.

What did you do?
Run

docker run -p 9091:9091  prom/pushgateway:v1.3.0 --push.disable-consistency-check

cat <<EOF | curl -X POST --data-binary @- http://127.0.0.1:9091/metrics/job/some_job/tag/val1
# TYPE some_metric counter
some_metric 1
EOF

cat <<EOF | curl -v -X POST --data-binary @- http://127.0.0.1:9091/metrics/job/some_job
# TYPE some_metric counter
some_metric{tag="val1"} 42
EOF

At this point the server is in an inconsistant state

$ curl -v  http://localhost:9091/metrics
...
< HTTP/1.1 500 Internal Server Error
...

What did you expect to see?

$ curl -v  http://localhost:9091/-/healthy  
...
...
< HTTP/1.1 200 OK

What did you see instead? Under which circumstances?
Ideally , once /metrics cannot be served, /-/healthy should return
an error code.

  • Pushgateway version:

v1.3.0

  • Pushgateway command line:

--push.disable-consistency-check

  • Logs:
    (I've no complaint about anything in the logs)
level=info ts=2023-03-24T10:21:11.546Z caller=main.go:83 msg="starting pushgateway" version="(version=1.3.0, branch=HEAD, revision=c28992c985ce6c4fcf4247ba9736b72a3d43882f)"
level=info ts=2023-03-24T10:21:11.546Z caller=main.go:84 build_context="(go=go1.15.2, user=root@cf69166ae53e, date=20201001-12:03:34)"
level=info ts=2023-03-24T10:21:11.547Z caller=main.go:137 listen_address=:9091
level=error ts=2023-03-24T10:21:26.999Z caller=main.go:56 msg="error gathering metrics: collected metric \"some_metric\" { label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"some_job\" > label:<name:\"tag\" value:\"val1\" > counter:<value:42 > } was collected before with the same name and label values\n"
level=info ts=2023-03-24T10:25:06.323Z caller=main.go:249 msg="received SIGINT/SIGTERM; exiting gracefully..."
level=error ts=2023-03-24T10:25:06.323Z caller=main.go:197 msg="HTTP server stopped" err="accept tcp [::]:9091: use of closed network connection"
@beorn7
Copy link
Member

beorn7 commented Mar 27, 2023

Hmmm… first of all I would like to understand what's actually going wrong in the first case (where consistency checks are enabled IIUC, and still the PGW reaches an inconsistent state). This smells like an actual bug that we shouldn't gloss over.

I'm also surprised that this state isn't persisted. It makes things even weirder.

If persistence worked (which it should), restarting the PGW wouldn't really help. If persistence is switched off, a restart would help for the time being, but it would also wipe the state you might want to investigate to understand what inconsistent metric has been pushed. And finally, with consistency checks active, the PGW should never get into this state in the first place, and auto-restarts would just be a work-around for a bug we don't understand.

From that perspective, I'm not so sure if failing the health check in case of inconsistent metrics is a good idea. A middle ground could be to put that behavior behind a flag.

@tcolgate
Copy link
Author

I do agree, and some other aspects strongly suggest something else was going on in our case. We've seen this twice, but unforatunately over the course of 2 (possibly 3) years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants