Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: TestStopLatency fails on nightly race detector tests #1294

Closed
nsrip-dd opened this issue May 14, 2022 · 9 comments · Fixed by #1297
Closed

profiler: TestStopLatency fails on nightly race detector tests #1294

nsrip-dd opened this issue May 14, 2022 · 9 comments · Fixed by #1297

Comments

@nsrip-dd
Copy link
Contributor

Example failure:

=== RUN   TestStopLatency
    profiler_test.go:226: Stop() took longer than 200ms:
        goroutine 10284 [running]:
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency(0xc000468180)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:225 +0x41d
        testing.tRunner(0xc000468180, 0x9f1638)
        	/usr/local/go/src/testing/testing.go:1193 +0x203
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1238 +0x5d8
        
        goroutine 1 [chan receive]:
        testing.(*T).Run(0xc000102c00, 0x9d6ccc, 0xf, 0x9f1638, 0x1)
        	/usr/local/go/src/testing/testing.go:1239 +0x610
        testing.runTests.func1(0xc000102c00)
        	/usr/local/go/src/testing/testing.go:1511 +0xa7
        testing.tRunner(0xc000102c00, 0xc00019bbe8)
        	/usr/local/go/src/testing/testing.go:1193 +0x203
        testing.runTests(0xc000142180, 0xc812e0, 0x18, 0x18, 0xc097daf60aa1cf82, 0x8bb3491687, 0xc87380, 0x1)
        	/usr/local/go/src/testing/testing.go:1509 +0x613
        testing.(*M).Run(0xc00018eb80, 0x0)
        	/usr/local/go/src/testing/testing.go:1417 +0x3b4
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestMain(0xc00018eb80)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:35 +0x55
        main.main()
        	_testmain.go:153 +0x372
        
        goroutine 10285 [semacquire]:
        sync.runtime_Semacquire(0xc000211548)
        	/usr/local/go/src/runtime/sema.go:56 +0x45
        sync.(*WaitGroup).Wait(0xc000211540)
        	/usr/local/go/src/sync/waitgroup.go:130 +0xe5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc0001e7b20, 0xc000361440)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:250 +0x288
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:202 +0x1dd
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:197 +0x194
        
        goroutine 10306 [semacquire]:
        sync.runtime_Semacquire(0xc0001e7b54)
        	/usr/local/go/src/runtime/sema.go:56 +0x45
        sync.(*WaitGroup).Wait(0xc0001e7b54)
        	/usr/local/go/src/sync/waitgroup.go:130 +0xe5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).stop(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:360 +0xa8
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency.func2(0xc0001e7b20, 0xc000024420)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:212 +0x39
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:211 +0x2e9
        
        goroutine 10222 [runnable]:
        compress/flate.(*compressor).init(0xc0007ba000, 0xa63d00, 0xc000402000, 0x1, 0x1, 0xa)
        	/usr/local/go/src/compress/flate/deflate.go:578 +0x132b
        compress/flate.NewWriter(0xa63d00, 0xc000402000, 0x1, 0xa, 0xa, 0x0)
        	/usr/local/go/src/compress/flate/deflate.go:671 +0xa5
        compress/gzip.(*Writer).Write(0xc0003c00b0, 0xc0001f1500, 0x5ff, 0x700, 0x40, 0x0, 0xc0002f44f0)
        	/usr/local/go/src/compress/gzip/gzip.go:191 +0x6b6
        runtime/pprof.(*profileBuilder).build(0xc0003c2160)
        	/usr/local/go/src/runtime/pprof/proto.go:379 +0x6c5
        runtime/pprof.profileWriter(0xa63d00, 0xc000402000)
        	/usr/local/go/src/runtime/pprof/pprof.go:813 +0x11b
        created by runtime/pprof.StartCPUProfile
        	/usr/local/go/src/runtime/pprof/pprof.go:784 +0x18b
        
        goroutine 10219 [chan receive]:
        runtime/pprof.StopCPUProfile()
        	/usr/local/go/src/runtime/pprof/pprof.go:829 +0xde
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).stopCPUProfile(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:99 +0xcc
        gopkg.in/DataDog/dd-trace-go.v1/profiler.glob..func1(0xc0001e7b20, 0x1, 0x9d2eae, 0x3, 0x9d47c1, 0x9)
        	/home/circleci/dd-trace-go.v1/profiler/profile.go:93 +0x1c5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc0001e7b20, 0x1, 0xc000000004, 0x470b35, 0xc00044a678, 0x470bed, 0xc3c41c)
        	/home/circleci/dd-trace-go.v1/profiler/profile.go:258 +0x129
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.func1(0xc000211540, 0xc0001e7b20, 0xc000211530, 0xc000143cc8, 0x1)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:240 +0xc6
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:238 +0x23d
--- FAIL: TestStopLatency (2.30s)

The test timeout was bumped from 20ms to 200ms by #1282 after noticing this same failure, but even a 200ms timeout isn't enough. CPU profiling can take slightly over 200ms to stop for the following reason:

  • CPU profiling starts a goroutine that polls for CPU samples on a 100ms loop.
  • When profiling stops, one iteration of the loop happens to read the remaining samples
  • The second iteration of the loop sleeps, then checks for samples and finds there aren't any, breaking out of the loop.

This can take 200ms + whatever time to serialize the CPU profile. I don't think #1282 broke the behavior so much as the test should have never passed in the first place.

nsrip-dd added a commit that referenced this issue May 14, 2022
The test needs to be re-evaluated as it should never have passed in the
first place. See issue #1294.
@felixge
Copy link
Member

felixge commented May 16, 2022

I don't think #1282 broke the behavior so much as the test should have never passed in the first place.

I'm actually having a hard time reproducing this test failure locally:

$ go test -run 'TestStopLatency' -count 100 -race ./profiler
ok  	gopkg.in/DataDog/dd-trace-go.v1/profiler	224.970s

Anyway, what about just increasing the timeout to e.g. 500ms? This test is mostly trying to prove that stopping a profiler that was just started takes (much) less time than the configured profiling period.

@nsrip-dd
Copy link
Contributor Author

nsrip-dd commented May 16, 2022

I see a few more failures locally, and I even notice from your test output that the time per iteration is 224ms (nevermind, can't math in the morning). It's borderline for sure with a 200ms timeout. I'd be okay with increasing the timeout more as a fix. Maybe coupled with making the profiling period really really long? Like 1 hour or something. Then we know the stop latency is okay-ish if it's several orders of magnitude shorter than the profiling period.

nsrip-dd added a commit that referenced this issue May 16, 2022
nsrip-dd added a commit that referenced this issue May 16, 2022
The timeout for the test was not realistic given the constraints of the
Go runtime CPU profiler. Increase the timeout and add a second test case
meant to exercise the latency of both stopping the profiling as well as
stopping the profile upload.

Fixes #1294
felixge pushed a commit that referenced this issue May 24, 2022
The timeout for the test was not realistic given the constraints of the
Go runtime CPU profiler. Increase the timeout and add a second test case
meant to exercise the latency of both stopping the profiling as well as
stopping the profile upload.

Fixes #1294
@nsrip-dd
Copy link
Contributor Author

Is there any reasonable bound we can place on how long stopping the upload should take? It surprises me in those failures that stopping takes up to a second.

@felixge
Copy link
Member

felixge commented May 24, 2022

Is there any reasonable bound we can place on how long stopping the upload should take?

Maybe the upload test could be changed to not depend on timing at all? We could just test that the profiler.Stop() method returns before we send a response to the http request. That should be enough evidence that we're not waiting for the request to finish?

It surprises me in those failures that stopping takes up to a second.

Me too. I'm not even sure what the root cause is for CircleCI to behave so extremely. Are we running the tests very differently? Maybe every pkg gets tested concurrently, causing big CPU spikes? Or are we getting hit by noisy neighbors? Or both?

@nsrip-dd
Copy link
Contributor Author

nsrip-dd commented May 24, 2022

Right, we could remove the time check for uploading and just rely on the channel that keeps the HTTP handler from returning until the end of the test. Then we know that if Stop returns, the upload was stopped. As for the slowness, it could be inconsistency from the CircleCI environment. I'm mainly hoping this isn't a regression caused by making the profile collection happen concurrently in #1282.

@nsrip-dd
Copy link
Contributor Author

I do have a theory on the TestStopLatency/upload failure, which I actually see locally as well (mea culpa for not doing more test runs):

  • I notice the failing times are roughly multiples of 200ms, which makes me suspect the CPU profiler's involved
  • The test has a profiling period of 10ms. Pretty short.
  • profiler.collect runs a select statement in a loop. The two cases are the time.Ticker that fires every profiling period and the p.exit channel that signals cancellation
  • The Go spec says that if multiple cases in a select statement are ready, one is chosen at random.
  • The ticker fires and gets chosen in the select statement. Profiling takes longer than 10ms, so the ticker is ready again.
  • p.exit is closed
  • The select statement is hit again, and since the ticker is ready again, it can possibly be selected even though p.exit is also ready.
  • CPU profiling runs, gets immediately canceled since p.exit is closed, and takes ~200ms to stop.
  • Repeat a few times until p.exit wins the select statement.

nsrip-dd added a commit that referenced this issue May 25, 2022
PR #1297 attempted to fix the flakiness noted in issue #1294 by creating
two seperate tests: one which runs a long profile to test the latency of
stopping the profile, and another which runs short profiles and makes
uploading hang indefinitely. However, the upload test had such a short
profiling period that the inner select statement in (*profiler).collect
could take several iterations to actually cancel due to the Go runtime
randomly choosing select cases when multiple cases are ready.

In addition, the "stop-profiler" case didn't actually test what it was
intended to test since profiling doesn't actually run until one full
profiling period has elapsed. Since the period was set to an hour, Stop
was called withouth profiling actually started.

Merge the two tests back into one. This brings us full-circle back to
the original test, but with a more generous window on how long stopping
should take and without relying on modifying internal functions to make
the test work.
@felixge
Copy link
Member

felixge commented May 30, 2022

Great investigation. The channel selection theory makes sense to me.

nsrip-dd added a commit that referenced this issue May 31, 2022
PR #1297 attempted to fix the flakiness noted in issue #1294 by creating
two seperate tests: one which runs a long profile to test the latency of
stopping the profile, and another which runs short profiles and makes
uploading hang indefinitely. However, the upload test had such a short
profiling period that the inner select statement in (*profiler).collect
could take several iterations to actually cancel due to the Go runtime
randomly choosing select cases when multiple cases are ready.

In addition, the "stop-profiler" case didn't actually test what it was
intended to test since profiling doesn't actually run until one full
profiling period has elapsed. Since the period was set to an hour, Stop
was called withouth profiling actually started.

Merge the two tests back into one. This brings us full-circle back to
the original test, but with a more generous window on how long stopping
should take and without relying on modifying internal functions to make
the test work.
@felixge
Copy link
Member

felixge commented Jun 1, 2022

#1307 should have fixed this, closing.

@felixge felixge closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants