New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus client library panics with counter cannot decrease in value
#108311
Comments
/sig instrumentation |
@tkashem: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @dgrisonnet |
cc @benluddy |
we are vendoring 1.12.0, should we update the version we are vendoring to get the fix? I see it is in v1.12.1 https://github.com/prometheus/client_golang/releases/tag/v1.12.1 |
yeah, I think we should update to v1.12.1. I was confused at first as to why we haven't seen this bug before since the go collector has been used in Kubernetes for while but it seems that its implementation was recently changed and the bug was introduced in v1.12.0. |
/triage accepted |
What happened?
We have unit test(s) that gather metrics for verification using
legacyregistry.DefaultGatherer
, for example:kubernetes/staging/src/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go
Lines 1288 to 1291 in 0ae6ef6
We have seen test to panic with the following error:
the panic seems to be originating from prometheus client library:
using the stack trace, we can see the offending code site:
https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/counter.go#L109
I checked the apf metrics package, we don't set any decreasing value for any counter, and it's clear from the stack trace that
Add
call is being issued by the collector.https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/go_collector_go117.go#L141-L149
so
m.Add(unwrapScalarRMValue(sample.Value) - m.get())
is producing a negative value here; looks like the vendored version of the collector is not thread safe, and the prometheus folks have already added lock for the 1.17 collector.prometheus/client_golang#975
prometheus/client_golang@648c419#diff-6096ed186872e4be5a35c1c6e8fbee95be0f2268753ad7b1d9685242e5dc681bR141-R150
The reason I see the test
TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
fail with this panic is because its subtests (that gathers metric) run in parallel.I think we should re-vendor the prometheus client library, any thoughts?
What did you expect to happen?
No panic
How can we reproduce it (as minimally and precisely as possible)?
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108013/pull-kubernetes-unit/1496512399061028864
The text was updated successfully, but these errors were encountered: