prometheus client library panics with `counter cannot decrease in value` #108311

tkashem · 2022-02-23T22:35:34Z

What happened?

We have unit test(s) that gather metrics for verification using legacyregistry.DefaultGatherer, for example:

kubernetes/staging/src/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go

Lines 1288 to 1291 in 0ae6ef6

    
           // gathers and checks the metrics. 
        
           func checkForExpectedMetrics(t *testing.T, expectedMetrics []string) { 
        
           	metricsFamily, err := legacyregistry.DefaultGatherer.Gather() 
        
           	if err != nil {

We have seen test to panic with the following error:

panic: counter cannot decrease in value

the panic seems to be originating from prometheus client library:

goroutine 6151 [running]:
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*counter).Add(0xc0003a41e0, 0xc067600000000000)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/counter.go:109 +0x1a5
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc0003ac000, 0xc0001e4f50)
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/go_collector_go117.go:147 +0x6a9
k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x134
created by k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0x1138

using the stack trace, we can see the offending code site:
https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/counter.go#L109

func (c *counter) Add(v float64) {
	if v < 0 {
		panic(errors.New("counter cannot decrease in value"))
	}
...
}

I checked the apf metrics package, we don't set any decreasing value for any counter, and it's clear from the stack trace that Add call is being issued by the collector.
https://github.com/kubernetes/kubernetes/blob/master/vendor/github.com/prometheus/client_golang/prometheus/go_collector_go117.go#L141-L149

		case *counter:
			// Guard against decreases. This should never happen, but a failure
			// to do so will result in a panic, which is a harsh consequence for
			// a metrics collection bug.
			v0, v1 := m.get(), unwrapScalarRMValue(sample.Value)
			if v1 > v0 {
				m.Add(unwrapScalarRMValue(sample.Value) - m.get())
			}
			m.Collect(ch)

so m.Add(unwrapScalarRMValue(sample.Value) - m.get()) is producing a negative value here; looks like the vendored version of the collector is not thread safe, and the prometheus folks have already added lock for the 1.17 collector.
prometheus/client_golang#975

prometheus/client_golang@648c419#diff-6096ed186872e4be5a35c1c6e8fbee95be0f2268753ad7b1d9685242e5dc681bR141-R150

The reason I see the test TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter fail with this panic is because its subtests (that gathers metric) run in parallel.

I think we should re-vendor the prometheus client library, any thoughts?

What did you expect to happen?

No panic

How can we reproduce it (as minimally and precisely as possible)?

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/108013/pull-kubernetes-unit/1496512399061028864

The text was updated successfully, but these errors were encountered:

tkashem · 2022-02-23T22:37:58Z

/sig instrumentation
/sig api-machinery (apf unit test flakes due to this issue, feel free to remove the sig label)

k8s-ci-robot · 2022-02-23T22:38:00Z

@tkashem: The label(s) sig/(apf, sig/unit, sig/test, sig/flakes, sig/due, sig/to, sig/this, sig/issue,, sig/feel, sig/free, sig/to, sig/correct, sig/it) cannot be applied, because the repository doesn't have them.

In response to this:

/sig instrumentation
/sig api-machinery (apf unit test flakes due to this issue, feel free to correct it)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tkashem · 2022-02-23T22:38:13Z

cc @dgrisonnet

tkashem · 2022-02-23T22:48:05Z

cc @benluddy

aojea · 2022-02-24T08:52:02Z

we are vendoring 1.12.0, should we update the version we are vendoring to get the fix?

I see it is in v1.12.1 https://github.com/prometheus/client_golang/releases/tag/v1.12.1

dgrisonnet · 2022-02-24T09:18:50Z

yeah, I think we should update to v1.12.1. I was confused at first as to why we haven't seen this bug before since the go collector has been used in Kubernetes for while but it seems that its implementation was recently changed and the bug was introduced in v1.12.0.

fedebongio · 2022-02-24T17:42:57Z

/triage accepted

tkashem added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2022

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 23, 2022

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 23, 2022

dgrisonnet mentioned this issue Feb 24, 2022

bump client_golang to v1.12.1 #108328

Merged

k8s-ci-robot closed this as completed in #108328 Feb 24, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus client library panics with `counter cannot decrease in value` #108311

prometheus client library panics with `counter cannot decrease in value` #108311

tkashem commented Feb 23, 2022 •

edited

tkashem commented Feb 23, 2022 •

edited

k8s-ci-robot commented Feb 23, 2022

tkashem commented Feb 23, 2022

tkashem commented Feb 23, 2022

aojea commented Feb 24, 2022

dgrisonnet commented Feb 24, 2022

fedebongio commented Feb 24, 2022

prometheus client library panics with counter cannot decrease in value #108311

prometheus client library panics with counter cannot decrease in value #108311

Comments

tkashem commented Feb 23, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

tkashem commented Feb 23, 2022 • edited

k8s-ci-robot commented Feb 23, 2022

tkashem commented Feb 23, 2022

tkashem commented Feb 23, 2022

aojea commented Feb 24, 2022

dgrisonnet commented Feb 24, 2022

fedebongio commented Feb 24, 2022

prometheus client library panics with `counter cannot decrease in value` #108311

prometheus client library panics with `counter cannot decrease in value` #108311

tkashem commented Feb 23, 2022 •

edited

tkashem commented Feb 23, 2022 •

edited