Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zapcore: Unflake TestSamplerConcurrent #1012

Merged
merged 2 commits into from Sep 10, 2021
Merged

zapcore: Unflake TestSamplerConcurrent #1012

merged 2 commits into from Sep 10, 2021

Conversation

abhinav
Copy link
Collaborator

@abhinav abhinav commented Sep 10, 2021

The TestSamplerConcurrent test frequently fails with the following error
in CI:

--- FAIL: TestSamplerConcurrent (0.25s)
    sampler_test.go:198:
    	    Error Trace:	sampler_test.go:198
    	    Error:      	Max difference between 1250 and 1004 allowed is 125, but difference was 246
    	    Test:       	TestSamplerConcurrent
    	    Messages:   	Unexpected number of logs
FAIL

The test is intended to verify that
despite an onsalught of messages from multiple goroutines,
we only allow at most logsPerTick messages per tick.

This was accompilshed by spin-looping 10 goroutines for numTicks,
each logging one of numMessages different messages,
and then verifying the final log count.

The source of flakiness here was the non-determinism in
how far a goroutine would get in logging enough messages such that
the sampler would be engaged.

In #948, we added a zapcore.Clock interface with a ticker and
a mock implementation.
Move that to ztest for use here.

To unflake the test, use the mock clock to control time and
for each goroutine, log logsPerTick*2 messages numTicks times.
This gives us,

for numGoroutines (10)
    for numTicks (25)
        log logsPerTick * 2 (50) messages

We end up attempting to log a total of,

(numGoroutines * numTicks * logsPerTick * 2) messages
= (10 * 25 * 50) messages
= 12500 messages

Of these, the following should be sampled:

numMessages * numTicks * logsPerTick
= 5 * 10 * 25
= 1250

Everything else should be dropped.

For extra confidence, use a SamplerHook (added in #813) to verify that
the number of sampled and dropped messages meet expectations.

Refs GO-873

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #1012 (f0f2f30) into master (a0e2380) will increase coverage by 0.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1012      +/-   ##
==========================================
+ Coverage   98.10%   98.20%   +0.10%     
==========================================
  Files          46       47       +1     
  Lines        2058     2062       +4     
==========================================
+ Hits         2019     2025       +6     
+ Misses         30       29       -1     
+ Partials        9        8       -1     
Impacted Files Coverage Δ
zapcore/clock.go 100.00% <ø> (ø)
internal/ztest/clock.go 100.00% <100.00%> (ø)
zapcore/sampler.go 100.00% <0.00%> (+3.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0e2380...f0f2f30. Read the comment docs.

We need to be able to use the controlled clock for some other tests so
move it from clock_test to the ztest package and rename it to MockClock.

To keep the interface for the MockClock clear,
don't embed the benbjohnson/clock and instead,
use it as an attribute.
The TestSamplerConcurrent test frequently fails with the following error
in CI:

    --- FAIL: TestSamplerConcurrent (0.25s)
        sampler_test.go:198:
        	    Error Trace:	sampler_test.go:198
        	    Error:      	Max difference between 1250 and 1004 allowed is 125, but difference was 246
        	    Test:       	TestSamplerConcurrent
        	    Messages:   	Unexpected number of logs
    FAIL

The test is intended to verify that
despite an onsalught of messages from multiple goroutines,
we only allow at most `logsPerTick` messages per `tick`.

This was accompilshed by spin-looping 10 goroutines for `numTicks`,
each logging one of `numMessages` different messages,
and then verifying the final log count.

The source of flakiness here was the non-determinism in
how far a goroutine would get in logging enough messages such that
the sampler would be engaged.

In #948, we added a `zapcore.Clock` interface with a ticker and
a mock implementation.
Move that to `ztest` for use here.

To unflake the test, use the mock clock to control time and
for each goroutine, log `logsPerTick*2` messages `numTicks` times.
This gives us,

    for numGoroutines (10)
        for numTicks (25)
            log logsPerTick * 2 (50) messages

We end up attempting to log a total of,

    (numGoroutines * numTicks * logsPerTick * 2) messages
    = (10 * 25 * 50) messages
    = 12500 messages

Of these, the following should be sampled:

    numMessages * numTicks * logsPerTick
    = 5 * 10 * 25
    = 1250

Everything else should be dropped.

For extra confidence, use a SamplerHook (added in #813) to verify that
the number of sampled and dropped messages meet expectations.
Copy link
Contributor

@sywhang sywhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@abhinav abhinav merged commit 10d89a7 into master Sep 10, 2021
@abhinav abhinav deleted the abg/sampler-flaky branch September 10, 2021 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants