profiler: TestStopLatency fails on nightly race detector tests #1294

nsrip-dd · 2022-05-14T14:03:57Z

=== RUN   TestStopLatency
    profiler_test.go:226: Stop() took longer than 200ms:
        goroutine 10284 [running]:
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency(0xc000468180)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:225 +0x41d
        testing.tRunner(0xc000468180, 0x9f1638)
        	/usr/local/go/src/testing/testing.go:1193 +0x203
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1238 +0x5d8
        
        goroutine 1 [chan receive]:
        testing.(*T).Run(0xc000102c00, 0x9d6ccc, 0xf, 0x9f1638, 0x1)
        	/usr/local/go/src/testing/testing.go:1239 +0x610
        testing.runTests.func1(0xc000102c00)
        	/usr/local/go/src/testing/testing.go:1511 +0xa7
        testing.tRunner(0xc000102c00, 0xc00019bbe8)
        	/usr/local/go/src/testing/testing.go:1193 +0x203
        testing.runTests(0xc000142180, 0xc812e0, 0x18, 0x18, 0xc097daf60aa1cf82, 0x8bb3491687, 0xc87380, 0x1)
        	/usr/local/go/src/testing/testing.go:1509 +0x613
        testing.(*M).Run(0xc00018eb80, 0x0)
        	/usr/local/go/src/testing/testing.go:1417 +0x3b4
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestMain(0xc00018eb80)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:35 +0x55
        main.main()
        	_testmain.go:153 +0x372
        
        goroutine 10285 [semacquire]:
        sync.runtime_Semacquire(0xc000211548)
        	/usr/local/go/src/runtime/sema.go:56 +0x45
        sync.(*WaitGroup).Wait(0xc000211540)
        	/usr/local/go/src/sync/waitgroup.go:130 +0xe5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect(0xc0001e7b20, 0xc000361440)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:250 +0x288
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run.func1(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:202 +0x1dd
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).run
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:197 +0x194
        
        goroutine 10306 [semacquire]:
        sync.runtime_Semacquire(0xc0001e7b54)
        	/usr/local/go/src/runtime/sema.go:56 +0x45
        sync.(*WaitGroup).Wait(0xc0001e7b54)
        	/usr/local/go/src/sync/waitgroup.go:130 +0xe5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).stop(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:360 +0xa8
        gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency.func2(0xc0001e7b20, 0xc000024420)
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:212 +0x39
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.TestStopLatency
        	/home/circleci/dd-trace-go.v1/profiler/profiler_test.go:211 +0x2e9
        
        goroutine 10222 [runnable]:
        compress/flate.(*compressor).init(0xc0007ba000, 0xa63d00, 0xc000402000, 0x1, 0x1, 0xa)
        	/usr/local/go/src/compress/flate/deflate.go:578 +0x132b
        compress/flate.NewWriter(0xa63d00, 0xc000402000, 0x1, 0xa, 0xa, 0x0)
        	/usr/local/go/src/compress/flate/deflate.go:671 +0xa5
        compress/gzip.(*Writer).Write(0xc0003c00b0, 0xc0001f1500, 0x5ff, 0x700, 0x40, 0x0, 0xc0002f44f0)
        	/usr/local/go/src/compress/gzip/gzip.go:191 +0x6b6
        runtime/pprof.(*profileBuilder).build(0xc0003c2160)
        	/usr/local/go/src/runtime/pprof/proto.go:379 +0x6c5
        runtime/pprof.profileWriter(0xa63d00, 0xc000402000)
        	/usr/local/go/src/runtime/pprof/pprof.go:813 +0x11b
        created by runtime/pprof.StartCPUProfile
        	/usr/local/go/src/runtime/pprof/pprof.go:784 +0x18b
        
        goroutine 10219 [chan receive]:
        runtime/pprof.StopCPUProfile()
        	/usr/local/go/src/runtime/pprof/pprof.go:829 +0xde
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).stopCPUProfile(0xc0001e7b20)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:99 +0xcc
        gopkg.in/DataDog/dd-trace-go.v1/profiler.glob..func1(0xc0001e7b20, 0x1, 0x9d2eae, 0x3, 0x9d47c1, 0x9)
        	/home/circleci/dd-trace-go.v1/profiler/profile.go:93 +0x1c5
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).runProfile(0xc0001e7b20, 0x1, 0xc000000004, 0x470b35, 0xc00044a678, 0x470bed, 0xc3c41c)
        	/home/circleci/dd-trace-go.v1/profiler/profile.go:258 +0x129
        gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect.func1(0xc000211540, 0xc0001e7b20, 0xc000211530, 0xc000143cc8, 0x1)
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:240 +0xc6
        created by gopkg.in/DataDog/dd-trace-go.v1/profiler.(*profiler).collect
        	/home/circleci/dd-trace-go.v1/profiler/profiler.go:238 +0x23d
--- FAIL: TestStopLatency (2.30s)

The test timeout was bumped from 20ms to 200ms by #1282 after noticing this same failure, but even a 200ms timeout isn't enough. CPU profiling can take slightly over 200ms to stop for the following reason:

CPU profiling starts a goroutine that polls for CPU samples on a 100ms loop.
When profiling stops, one iteration of the loop happens to read the remaining samples
The second iteration of the loop sleeps, then checks for samples and finds there aren't any, breaking out of the loop.

This can take 200ms + whatever time to serialize the CPU profile. I don't think #1282 broke the behavior so much as the test should have never passed in the first place.

The text was updated successfully, but these errors were encountered:

The test needs to be re-evaluated as it should never have passed in the first place. See issue #1294.

felixge · 2022-05-16T06:58:50Z

I don't think #1282 broke the behavior so much as the test should have never passed in the first place.

I'm actually having a hard time reproducing this test failure locally:

$ go test -run 'TestStopLatency' -count 100 -race ./profiler
ok  	gopkg.in/DataDog/dd-trace-go.v1/profiler	224.970s

Anyway, what about just increasing the timeout to e.g. 500ms? This test is mostly trying to prove that stopping a profiler that was just started takes (much) less time than the configured profiling period.

nsrip-dd · 2022-05-16T12:04:09Z

I see a few more failures locally, ~~and I even notice from your test output that the time per iteration is 224ms~~ (nevermind, can't math in the morning). It's borderline for sure with a 200ms timeout. I'd be okay with increasing the timeout more as a fix. Maybe coupled with making the profiling period really really long? Like 1 hour or something. Then we know the stop latency is okay-ish if it's several orders of magnitude shorter than the profiling period.

See issue #1294.

The timeout for the test was not realistic given the constraints of the Go runtime CPU profiler. Increase the timeout and add a second test case meant to exercise the latency of both stopping the profiling as well as stopping the profile upload. Fixes #1294

felixge · 2022-05-24T10:44:47Z

Reopening this. PR flaked after merge:

See #1294

nsrip-dd · 2022-05-24T12:30:15Z

Is there any reasonable bound we can place on how long stopping the upload should take? It surprises me in those failures that stopping takes up to a second.

felixge · 2022-05-24T14:03:16Z

Is there any reasonable bound we can place on how long stopping the upload should take?

Maybe the upload test could be changed to not depend on timing at all? We could just test that the profiler.Stop() method returns before we send a response to the http request. That should be enough evidence that we're not waiting for the request to finish?

It surprises me in those failures that stopping takes up to a second.

Me too. I'm not even sure what the root cause is for CircleCI to behave so extremely. Are we running the tests very differently? Maybe every pkg gets tested concurrently, causing big CPU spikes? Or are we getting hit by noisy neighbors? Or both?

nsrip-dd · 2022-05-24T15:47:37Z

Right, we could remove the time check for uploading and just rely on the channel that keeps the HTTP handler from returning until the end of the test. Then we know that if Stop returns, the upload was stopped. As for the slowness, it could be inconsistency from the CircleCI environment. I'm mainly hoping this isn't a regression caused by making the profile collection happen concurrently in #1282.

nsrip-dd · 2022-05-25T15:39:13Z

I do have a theory on the TestStopLatency/upload failure, which I actually see locally as well (mea culpa for not doing more test runs):

I notice the failing times are roughly multiples of 200ms, which makes me suspect the CPU profiler's involved
The test has a profiling period of 10ms. Pretty short.
profiler.collect runs a select statement in a loop. The two cases are the time.Ticker that fires every profiling period and the p.exit channel that signals cancellation
The Go spec says that if multiple cases in a select statement are ready, one is chosen at random.
The ticker fires and gets chosen in the select statement. Profiling takes longer than 10ms, so the ticker is ready again.
p.exit is closed
The select statement is hit again, and since the ticker is ready again, it can possibly be selected even though p.exit is also ready.
CPU profiling runs, gets immediately canceled since p.exit is closed, and takes ~200ms to stop.
Repeat a few times until p.exit wins the select statement.

PR #1297 attempted to fix the flakiness noted in issue #1294 by creating two seperate tests: one which runs a long profile to test the latency of stopping the profile, and another which runs short profiles and makes uploading hang indefinitely. However, the upload test had such a short profiling period that the inner select statement in (*profiler).collect could take several iterations to actually cancel due to the Go runtime randomly choosing select cases when multiple cases are ready. In addition, the "stop-profiler" case didn't actually test what it was intended to test since profiling doesn't actually run until one full profiling period has elapsed. Since the period was set to an hour, Stop was called withouth profiling actually started. Merge the two tests back into one. This brings us full-circle back to the original test, but with a more generous window on how long stopping should take and without relying on modifying internal functions to make the test work.

felixge · 2022-05-30T08:54:51Z

Great investigation. The channel selection theory makes sense to me.

PR #1297 attempted to fix the flakiness noted in issue #1294 by creating two seperate tests: one which runs a long profile to test the latency of stopping the profile, and another which runs short profiles and makes uploading hang indefinitely. However, the upload test had such a short profiling period that the inner select statement in (*profiler).collect could take several iterations to actually cancel due to the Go runtime randomly choosing select cases when multiple cases are ready. In addition, the "stop-profiler" case didn't actually test what it was intended to test since profiling doesn't actually run until one full profiling period has elapsed. Since the period was set to an hour, Stop was called withouth profiling actually started. Merge the two tests back into one. This brings us full-circle back to the original test, but with a more generous window on how long stopping should take and without relying on modifying internal functions to make the test work.

felixge · 2022-06-01T10:16:04Z

#1307 should have fixed this, closing.

nsrip-dd added a commit that referenced this issue May 14, 2022

profiler: disable TestStopLatency

b0bd485

The test needs to be re-evaluated as it should never have passed in the first place. See issue #1294.

nsrip-dd mentioned this issue May 14, 2022

profiler: disable TestStopLatency #1295

Merged

nsrip-dd added a commit that referenced this issue May 16, 2022

profiler: disable TestStopLatency (#1295)

06c3a1d

See issue #1294.

nsrip-dd mentioned this issue May 16, 2022

profiler: fix TestStopLatency #1297

Merged

felixge closed this as completed in #1297 May 24, 2022

felixge reopened this May 24, 2022

felixge added a commit that referenced this issue May 24, 2022

profiler: Disable TestStopLatency

2e38770

See #1294

felixge mentioned this issue May 24, 2022

profiler: Disable TestStopLatency #1306

Merged

nsrip-dd pushed a commit that referenced this issue May 24, 2022

profiler: Disable TestStopLatency (#1306)

f6c2e59

See #1294

nsrip-dd mentioned this issue May 25, 2022

profiler: fix TestStopLatency (take 2) #1307

Merged

felixge closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiler: TestStopLatency fails on nightly race detector tests #1294

profiler: TestStopLatency fails on nightly race detector tests #1294

nsrip-dd commented May 14, 2022

felixge commented May 16, 2022

nsrip-dd commented May 16, 2022 •

edited

felixge commented May 24, 2022

nsrip-dd commented May 24, 2022

felixge commented May 24, 2022 •

edited

nsrip-dd commented May 24, 2022 •

edited

nsrip-dd commented May 25, 2022

felixge commented May 30, 2022

felixge commented Jun 1, 2022

profiler: TestStopLatency fails on nightly race detector tests #1294

profiler: TestStopLatency fails on nightly race detector tests #1294

Comments

nsrip-dd commented May 14, 2022

felixge commented May 16, 2022

nsrip-dd commented May 16, 2022 • edited

felixge commented May 24, 2022

nsrip-dd commented May 24, 2022

felixge commented May 24, 2022 • edited

nsrip-dd commented May 24, 2022 • edited

nsrip-dd commented May 25, 2022

felixge commented May 30, 2022

felixge commented Jun 1, 2022

nsrip-dd commented May 16, 2022 •

edited

felixge commented May 24, 2022 •

edited

nsrip-dd commented May 24, 2022 •

edited