Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.12.0 race condition #968

Closed
nicolaasuni opened this issue Jan 19, 2022 · 11 comments · Fixed by #969
Closed

v1.12.0 race condition #968

nicolaasuni opened this issue Jan 19, 2022 · 11 comments · Fixed by #969

Comments

@nicolaasuni
Copy link

After upgrading to: github.com/prometheus/client_golang v1.12.0
I am getting "WARNING: DATA RACE" while testing:
https://github.com/nexmoinc/gosrvlib/tree/main/pkg/metrics/prometheus

The race conditions do not appear in version 1.11.0 or any other previous version.

The race condition do not appear when disabling the t.Parallel() in the unit tests (e.g.: https://github.com/nexmoinc/gosrvlib/blob/main/pkg/metrics/prometheus/client_test.go#L153).

The relevant errors are:

==================
WARNING: DATA RACE
Read at 0x00c000598660 by goroutine 59:
 runtime.mapaccess2_faststr()
     /usr/local/go/src/runtime/map_faststr.go:107 +0x0
 runtime/metrics.runtime_readMetrics()
     /usr/local/go/src/runtime/metrics.go:563 +0xb9
 github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x133

Previous write at 0x00c000598660 by goroutine 70:
 runtime.mapassign_faststr()
     /usr/local/go/src/runtime/map_faststr.go:202 +0x0
 runtime.initMetrics()
     /usr/local/go/src/runtime/metrics.go:67 +0x1c6
 github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x133

Goroutine 59 (running) created at:
 github.com/prometheus/client_golang/prometheus.(*Registry).Gather()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:538 +0x1137
 github.com/prometheus/client_golang/prometheus/testutil.GatherAndCount()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/testutil/testutil.go:139 +0x7a
 github.com/nexmoinc/gosrvlib/pkg/metrics/prometheus.TestIncLogLevelCounter()
     /home/user/go/src/github.com/nexmoinc/gosrvlib/pkg/metrics/prometheus/client_test.go:144 +0x157
 testing.tRunner()
     /usr/local/go/src/testing/testing.go:1259 +0x22f
 testing.(*T).Run·dwrap·21()
     /usr/local/go/src/testing/testing.go:1306 +0x47

Goroutine 70 (finished) created at:
 github.com/prometheus/client_golang/prometheus.(*Registry).Gather()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:538 +0x1137
 github.com/prometheus/client_golang/prometheus/promhttp.HandlerFor.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/http.go:126 +0x11a
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:40 +0x118
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:117 +0xe2
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 net/http.Handler.ServeHTTP-fm()
     /usr/local/go/src/net/http/server.go:87 +0x75
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func2()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:204 +0xb3
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:239 +0xec
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:117 +0xe2
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:84 +0xca
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1()
     /home/user/go/pkg/mod/github.com/prometheus/client_golang@v1.12.0/prometheus/promhttp/instrument_server.go:40 +0x118
 net/http.HandlerFunc.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2047 +0x4d
 github.com/nexmoinc/gosrvlib/pkg/metrics/prometheus.TestInstrumentHandler()
     /home/user/go/src/github.com/nexmoinc/gosrvlib/pkg/metrics/prometheus/client_test.go:89 +0x3a1
 testing.tRunner()
     /usr/local/go/src/testing/testing.go:1259 +0x22f
 testing.(*T).Run·dwrap·21()
     /usr/local/go/src/testing/testing.go:1306 +0x47
==================
=== CONT  TestInstrumentDB
   testing.go:1152: race detected during execution of test
--- FAIL: TestInstrumentDB (0.03s)
=== CONT  TestIncLogLevelCounter
   testing.go:1152: race detected during execution of test
--- FAIL: TestIncLogLevelCounter (0.04s)
=== CONT  TestInstrumentRoundTripper
   testing.go:1152: race detected during execution of test
--- FAIL: TestInstrumentRoundTripper (0.06s)
=== CONT  TestInstrumentHandler
   testing.go:1152: race detected during execution of test
--- FAIL: TestInstrumentHandler (0.07s)
FAIL
@kakkoyun
Copy link
Member

Interesting 🤔 I do not recall any changes that could cause this. We'll look into it.

@kakkoyun
Copy link
Member

Could you provide more about your current setup? OS? Arch? And the test command that you use?

@nicolaasuni
Copy link
Author

nicolaasuni commented Jan 20, 2022

@kakkoyun Thank you for looking into this.

The testing environment is:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal
Linux xxxx 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

go version go1.17.6 linux/amd64

I have pushed the failing version on the "dependencies" branch:
https://github.com/nexmoinc/gosrvlib/tree/dependencies

You can test the prometheus package via the command:

gotest -race -failfast -v github.com/nexmoinc/gosrvlib/pkg/metrics/prometheus

@beorn7
Copy link
Member

beorn7 commented Jan 20, 2022

I think the problem is that the newly implemented Go collector doesn't have a goroutine-safe Collect method. In particular, the call metrics.Read needs to be protected, see

metrics.Read(c.rmSampleBuf)

/cc @mknyszek

@mknyszek
Copy link
Contributor

Hm... metrics.Read is serialized (with only itself) internally, though. There may be other parts of Collect that are not properly serialized, though.

@beorn7
Copy link
Member

beorn7 commented Jan 20, 2022

But metrics.Read gets c.rmSampleBuf passed in. So even if metrics.Read itself is serialized, c.rmSampleBuf is later ranged over while another goroutine might write to it.

@mknyszek
Copy link
Contributor

Right! Thanks. OK, that should be an easy fix. runtime/metrics has a critical section on the order of microseconds (even with all the histograms). Is it safe to assume Collect is on average called at a much higher period? If so, a simple mutex will likely suffice.

@beorn7
Copy link
Member

beorn7 commented Jan 20, 2022

I would probably go for a simple mutex for now. If there are smarter solutions, we can implement them later.

@beorn7
Copy link
Member

beorn7 commented Jan 20, 2022

And yes, scraping can be assumed to be a relatively rare event (a few times a minute in usual cases, a few times a second in extreme cases).

@mknyszek
Copy link
Contributor

Sent PR #969. Tests now pass in -race mode.

@nicolaasuni
Copy link
Author

When are you planning to release and tag this fix?
Currently I cant update to the new v1.12.0 as it contains the race condition bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants