zapcore: Unflake TestSamplerConcurrent #1012

abhinav · 2021-09-10T20:16:12Z

The TestSamplerConcurrent test frequently fails with the following error
in CI:

--- FAIL: TestSamplerConcurrent (0.25s)
    sampler_test.go:198:
    	    Error Trace:	sampler_test.go:198
    	    Error:      	Max difference between 1250 and 1004 allowed is 125, but difference was 246
    	    Test:       	TestSamplerConcurrent
    	    Messages:   	Unexpected number of logs
FAIL

The test is intended to verify that
despite an onsalught of messages from multiple goroutines,
we only allow at most logsPerTick messages per tick.

This was accompilshed by spin-looping 10 goroutines for numTicks,
each logging one of numMessages different messages,
and then verifying the final log count.

The source of flakiness here was the non-determinism in
how far a goroutine would get in logging enough messages such that
the sampler would be engaged.

In #948, we added a zapcore.Clock interface with a ticker and
a mock implementation.
Move that to ztest for use here.

To unflake the test, use the mock clock to control time and
for each goroutine, log logsPerTick*2 messages numTicks times.
This gives us,

for numGoroutines (10)
    for numTicks (25)
        log logsPerTick * 2 (50) messages

We end up attempting to log a total of,

(numGoroutines * numTicks * logsPerTick * 2) messages
= (10 * 25 * 50) messages
= 12500 messages

Of these, the following should be sampled:

numMessages * numTicks * logsPerTick
= 5 * 10 * 25
= 1250

Everything else should be dropped.

For extra confidence, use a SamplerHook (added in #813) to verify that
the number of sampled and dropped messages meet expectations.

Refs GO-873

codecov · 2021-09-10T20:17:14Z

Codecov Report

Merging #1012 (f0f2f30) into master (a0e2380) will increase coverage by 0.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1012      +/-   ##
==========================================
+ Coverage   98.10%   98.20%   +0.10%     
==========================================
  Files          46       47       +1     
  Lines        2058     2062       +4     
==========================================
+ Hits         2019     2025       +6     
+ Misses         30       29       -1     
+ Partials        9        8       -1

Impacted Files	Coverage Δ
zapcore/clock.go	`100.00% <ø> (ø)`
internal/ztest/clock.go	`100.00% <100.00%> (ø)`
zapcore/sampler.go	`100.00% <0.00%> (+3.77%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e2380...f0f2f30. Read the comment docs.

We need to be able to use the controlled clock for some other tests so move it from clock_test to the ztest package and rename it to MockClock. To keep the interface for the MockClock clear, don't embed the benbjohnson/clock and instead, use it as an attribute.

The TestSamplerConcurrent test frequently fails with the following error in CI: --- FAIL: TestSamplerConcurrent (0.25s) sampler_test.go:198: Error Trace: sampler_test.go:198 Error: Max difference between 1250 and 1004 allowed is 125, but difference was 246 Test: TestSamplerConcurrent Messages: Unexpected number of logs FAIL The test is intended to verify that despite an onsalught of messages from multiple goroutines, we only allow at most `logsPerTick` messages per `tick`. This was accompilshed by spin-looping 10 goroutines for `numTicks`, each logging one of `numMessages` different messages, and then verifying the final log count. The source of flakiness here was the non-determinism in how far a goroutine would get in logging enough messages such that the sampler would be engaged. In #948, we added a `zapcore.Clock` interface with a ticker and a mock implementation. Move that to `ztest` for use here. To unflake the test, use the mock clock to control time and for each goroutine, log `logsPerTick*2` messages `numTicks` times. This gives us, for numGoroutines (10) for numTicks (25) log logsPerTick * 2 (50) messages We end up attempting to log a total of, (numGoroutines * numTicks * logsPerTick * 2) messages = (10 * 25 * 50) messages = 12500 messages Of these, the following should be sampled: numMessages * numTicks * logsPerTick = 5 * 10 * 25 = 1250 Everything else should be dropped. For extra confidence, use a SamplerHook (added in #813) to verify that the number of sampled and dropped messages meet expectations.

sywhang

👍 LGTM

abhinav added 2 commits September 10, 2021 13:19

abhinav force-pushed the abg/sampler-flaky branch from 77c90b4 to f0f2f30 Compare September 10, 2021 20:20

abhinav requested review from moisesvega, shirchen and sywhang September 10, 2021 20:26

sywhang approved these changes Sep 10, 2021

View reviewed changes

abhinav merged commit 10d89a7 into master Sep 10, 2021

abhinav deleted the abg/sampler-flaky branch September 10, 2021 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zapcore: Unflake TestSamplerConcurrent #1012

zapcore: Unflake TestSamplerConcurrent #1012

abhinav commented Sep 10, 2021

codecov bot commented Sep 10, 2021 •

edited

sywhang left a comment

zapcore: Unflake TestSamplerConcurrent #1012

zapcore: Unflake TestSamplerConcurrent #1012

Conversation

abhinav commented Sep 10, 2021

codecov bot commented Sep 10, 2021 • edited

Codecov Report

sywhang left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 10, 2021 •

edited