Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus client library panics with counter cannot decrease in value #108311

Closed
tkashem opened this issue Feb 23, 2022 · 7 comments · Fixed by #108328
Closed

prometheus client library panics with counter cannot decrease in value #108311

tkashem opened this issue Feb 23, 2022 · 7 comments · Fixed by #108328
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tkashem
Copy link
Contributor

tkashem commented Feb 23, 2022

What happened?

We have unit test(s) that gather metrics for verification using legacyregistry.DefaultGatherer, for example:

// gathers and checks the metrics.
func checkForExpectedMetrics(t *testing.T, expectedMetrics []string) {
metricsFamily, err := legacyregistry.DefaultGatherer.Gather()
if err != nil {

We have seen test to panic with the following error:

panic: counter cannot decrease in value

the panic seems to be originating from prometheus client library:

goroutine 6151 [running]:
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc0003a41e0, 0xc067600000000000)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/counter.go:109 +0x1a5
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc0003ac000, 0xc0001e4f50)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/go_collector_go117.go:147 +0x6a9
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x134
created by k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0x1138

using the stack trace, we can see the offending code site:
https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/counter.go#L109

func (c *counter) Add(v float64) {
	if v < 0 {
		panic(errors.New("counter cannot decrease in value"))
	}
...
}

I checked the apf metrics package, we don't set any decreasing value for any counter, and it's clear from the stack trace that Add call is being issued by the collector.
https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/go_collector_go117.go#L141-L149

		case *counter:
			// Guard against decreases. This should never happen, but a failure
			// to do so will result in a panic, which is a harsh consequence for
			// a metrics collection bug.
			v0, v1 := m.get(), unwrapScalarRMValue(sample.Value)
			if v1 > v0 {
				m.Add(unwrapScalarRMValue(sample.Value) - m.get())
			}
			m.Collect(ch)

so m.Add(unwrapScalarRMValue(sample.Value) - m.get()) is producing a negative value here; looks like the vendored version of the collector is not thread safe, and the prometheus folks have already added lock for the 1.17 collector.
prometheus/client_golang#975

prometheus/client_golang@648c419#diff-6096ed186872e4be5a35c1c6e8fbee95be0f2268753ad7b1d9685242e5dc681bR141-R150

The reason I see the test TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter fail with this panic is because its subtests (that gathers metric) run in parallel.

I think we should re-vendor the prometheus client library, any thoughts?

What did you expect to happen?

No panic

How can we reproduce it (as minimally and precisely as possible)?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108013/pull-kubernetes-unit/1496512399061028864

@tkashem tkashem added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2022
@tkashem
Copy link
Contributor Author

tkashem commented Feb 23, 2022

/sig instrumentation
/sig api-machinery (apf unit test flakes due to this issue, feel free to remove the sig label)

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 23, 2022
@k8s-ci-robot
Copy link
Contributor

@tkashem: The label(s) sig/(apf, sig/unit, sig/test, sig/flakes, sig/due, sig/to, sig/this, sig/issue,, sig/feel, sig/free, sig/to, sig/correct, sig/it) cannot be applied, because the repository doesn't have them.

In response to this:

/sig instrumentation
/sig api-machinery (apf unit test flakes due to this issue, feel free to correct it)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 23, 2022
@tkashem
Copy link
Contributor Author

tkashem commented Feb 23, 2022

cc @dgrisonnet

@tkashem
Copy link
Contributor Author

tkashem commented Feb 23, 2022

cc @benluddy

@aojea
Copy link
Member

aojea commented Feb 24, 2022

we are vendoring 1.12.0, should we update the version we are vendoring to get the fix?

I see it is in v1.12.1 https://github.com/prometheus/client_golang/releases/tag/v1.12.1

@dgrisonnet
Copy link
Member

yeah, I think we should update to v1.12.1. I was confused at first as to why we haven't seen this bug before since the go collector has been used in Kubernetes for while but it seems that its implementation was recently changed and the bug was introduced in v1.12.0.

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants