Flaky Test Tracker #9492

russjones · 2021-12-19T18:30:19Z

Investigating

Process

Start with the assumption the test is correct and is highlighting a bug in Teleport.
Run multiple parallel unit or integration tests to reproduce.
Attempt to fix the test.
Propose quarantine.

Unit Tests

Frequently fails. github.com/gravitational/teleport/lib/service.TestTeleportProcess_reconnectToAuth
Frequently fails. github.com/gravitational/teleport/lib/srv/regular.TestClientDisconnect
Frequently fails. github.com/gravitational/teleport/lib/cache.TestCache_Backoff
~~github.com/gravitational/teleport/lib/srv/regular.TestProxyReverseTunnel~~
~~github.com/gravitational/teleport/lib/auth.TestAPILockedOut~~
github.com/gravitational/teleport/lib/auth.TestAPI
Often fails locally github.com/gravitational/teleport/lib/auth.TestTiming (also reported in Test flakes #4653)

Integration

Frequently fails. [TestIntegrations/TwoClustersTunnel](Tunnel auth clients appear to become stuck in bad state on restart #9655)
TestIntegrations/Disconnection
TestHSMDualAuthRotation
TestHSMMigrate
TestIntegrations/MultiplexingTrustedClusters
~~TestIntegrations/RotateTrustedClusters~~

Metrics

Trailing 7-day pass rate for unit and integration tests.

Week of January 10th. Unit 67%, Integration 56%

Proposed for Quarantine

This section is for tests that provide business value but are inherently flaky due to a dependence on time and an external resource (like CPU or network). For example, a test that waits for an event to occur and times out if the event does not occur after some time.

Quarantined tests will be triaged by @russjones weekly and potentially serialized and put into a retry loop.

Flakey password verification test #9491
lib/auth.PasswordSuite.TestTiming requires exists/not exists tests to be within 10% of eachother

Fixed

The text was updated successfully, but these errors were encountered:

fspmarshall · 2021-12-20T19:50:19Z

Found a race in TwoClustersTunnel which I believe is the cause of this error:

failed connecting to node localhost. remote cluster &#34;site-A&#34; is not found

I'm not sure if this is the only problem with TwoClustersTunnel, but I'll push a fix and we can watch to see if its flakiness goes down.

edit: See #9506

tcsc · 2021-12-20T22:35:28Z

I'm not sure if this is the only problem with TwoClustersTunnel

I see these quite often, too. May be related to the above.

integration_test.go:1510: 
Error Trace:	integration_test.go:1510
                        integration_test.go:1388
Error:      	Received unexpected error:
       	            	connection error: desc = "transport: Error while dialing failed to dial: failed connecting to node . invalid format for proxy request: unknown cluster \"site-A\"\n"

integration_test.go:1528: 
Error Trace:	integration_test.go:1528
                        integration_test.go:1388
Error:      	Condition never satisfied
Test:       	TestIntegrations/TwoClustersTunnel/proxy
Messages:   	Timed out waiting for Site A to restart

zmb3 · 2021-12-21T16:37:11Z

Is this a duplicate of #4653? Should we combine them?

russjones · 2021-12-22T01:46:09Z

@zmb3 Yeah, I think we should close that one and merge things into this one.

zmb3 · 2021-12-29T19:32:21Z

TwoClustersTunnel often fails for me with:

failed connecting to node localhost. database is closed

Occasionally I get a similar error that cays cache is closed instead of database.

Interestingly:

I can only reproduce when I run with the race detector enabled.
If I remove the section of the test that stops site A and restarts it, I see completely different errors

At some point in the tests, I start seeing tons of Uploader scan failed errors. I have a feeling the vast majority of this code is failing to clean up properly and is removing directories while they are still in use.

rosstimothy · 2022-01-04T16:24:22Z

Opened #9516 to address:

github.com/gravitational/teleport/lib/service.TestTeleportProcess_reconnectToAuth
github.com/gravitational/teleport/lib/service.TestResourceWatcher_Backoff
github.com/gravitational/teleport/lib/cache.TestCache_Backoff

russjones · 2022-01-05T19:03:12Z

For TestIntegrations/TwoClustersTunnel: #9655

russjones · 2022-01-31T17:47:06Z

A good way to reproduce issue is using while. You might also want to run this command in multiple terminal windows.

Example command:

$ while go test . -run TestAccessMongoDB -count=1 -race; do :; done

tcsc · 2022-01-31T23:30:35Z

I have a similar script I use called untilfail:

#!/bin/bash

COUNT=1
while "$@"; do COUNT=$((COUNT + 1)); done

echo Ran $COUNT times

Then I can pass it an arbitrary command line:

$ untilfail go test ./integration -run TestSomethingOrOther -race

zmb3 · 2022-02-04T02:01:27Z

I've started seeing TestSSHConfigConnectWithOpenSSHClient failures. Not sure if it's just my environment or not, but figured I'd put a note here. Spent 30 minutes or so debugging and I don't know what's going on.

zmb@localhost: Permission denied (publickey).
ERROR: exit status 255

Edit: It appears that even though this test passes a config file with ssh -F, openssh is still looking at my local ~/.ssh/config file. This means that this tests depends on local system state.

zmb3 · 2022-02-09T16:08:33Z

Here's a new one that I ran into on GCB. Can't seem to repro locally.

OUTPUT github.com/gravitational/teleport/lib/utils.TestFnCacheSanity
===================================================
=== RUN   TestFnCacheSanity
    fncache_test.go:145: 
        	Error Trace:	fncache_test.go:145
        	            				fncache_test.go:71
        	Error:      	Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995
        	Test:       	TestFnCacheSanity
        	Messages:   	ttl=40ms, delay=0s, desc="non-blocking"
--- FAIL: TestFnCacheSanity (1.53s)
===================================================

Convert approxReads to an integer (by truncating) before comparing to actual reads. This should prevent failures where due to our approximation, we estimate a fractional number of reads that exceed our tolerance of 1. Sample error: Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995 Updates #9492

Increase tolerance on expected reads. This should prevent failures where due to our approximation, we estimate a fractional number of reads that exceed our tolerance of 1. Sample error: Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995 Updates #9492

zmb3 · 2022-03-14T20:57:58Z

@r0mant have you seen this one before TestDatabaseResource?

2022-03-14T20:24:44Z ERRO [AUTH:2]    PID: 55274 Failed to bind to address 127.0.0.1:32963: listen tcp 127.0.0.1:32963: bind: address already in use, exiting. service/service.go:1330
    helpers_test.go:145: 
        	Error Trace:	helpers_test.go:145
        	            				resource_command_test.go:138
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:32963: bind: address already in use
        	Test:       	TestDatabaseResource
--- FAIL: TestDatabaseResource (3.35s)

zmb3 · 2022-03-16T18:36:31Z

Another one: TestDatabaseRootLeafIdleTimeout (cc @smallinsky)

I've seen both of the subtests fail with different errors.

    --- FAIL: TestDatabaseRootLeafIdleTimeout/leaf_role_with_idle_timeoutz
        db_integration_test.go:371: 
        	Error Trace:	db_integration_test.go:371
        	Error:      	An error is expected but got nil.
        	Test:       	TestDatabaseRootLeafIdleTimeout/leaf_role_with_idle_timeout

    --- FAIL: TestDatabaseRootLeafIdleTimeout/root_role_with_idle_timeout (1.21s)
        db_integration_test.go:613: event type "client.disconnect" not found after 1s
FAIL
FAIL    github.com/gravitational/teleport/integration   5.218s
FAIL

Took me 23 runs to reproduce locally, but it does eventually fail.

nklaassen · 2022-07-14T16:22:37Z

TestBot_Run_CARotation: #14471

nklaassen · 2022-07-15T19:16:48Z

TestFnCacheSanity: #14534

nklaassen · 2022-07-16T00:16:44Z

TestAgentStart: #14553

nklaassen · 2022-07-16T00:48:21Z

TestNormalOperation: #14554

nklaassen · 2022-07-16T01:29:45Z

TestFnCacheCancellation: #14556

ravicious · 2022-07-21T08:45:39Z

TestTokens: #14737 (tool/tctl/common/token_command_test.go)

zmb3 · 2022-07-25T15:24:54Z

TestSemaphoreLock: #14842

rosstimothy · 2022-07-28T04:05:51Z

Data Race in TestTokenGeneration: #14979

zmb3 · 2022-07-29T19:34:58Z

TestIntegrations/ListResourcesAcrossClusters: #15051

zmb3 · 2022-08-01T19:28:09Z

TestIntegrations/SessionRecordingModes/StrictMode: #15097

nklaassen · 2022-08-12T16:52:20Z

failed to parse existing kubeconfig tsh test flakiness #15490

ibeckermayer · 2022-08-14T21:00:25Z

TestMain (sshserver_test.go): #15520

GavinFrazar · 2022-08-18T17:41:52Z

TestAllowedUsers: #15656

ibeckermayer · 2022-08-24T19:05:53Z

TestRoleDeletionDrift: #15815

Joerger · 2022-08-24T19:44:12Z

TestWebSessionsRenewDoesNotBreakExistingTerminalSession: #15816

ibeckermayer · 2022-08-25T15:00:40Z

tbot race condition: #15843

espadolini · 2022-08-29T11:05:54Z

~~TestPingConnection~~: #15896

strideynet · 2022-09-01T09:03:34Z

TestTeleportClient_Login_local/OTP_device_login_with_hijack: #16037

GavinFrazar · 2022-09-03T10:44:27Z

TestHandlerConnectionUpgrade data race: #16128

ibeckermayer · 2022-09-05T20:45:28Z

TestGlobalRequestRecordingProxy: #16138

ibeckermayer · 2022-09-07T15:27:03Z

TestAppAccess/TestAppInvalidateAppSessionsOnLogout: #16200

nklaassen · 2022-09-30T19:10:43Z

TestRootScript: #16908

tigrato · 2022-10-10T17:19:34Z

TestIntegrations/AuditOn: #17224

nklaassen · 2022-10-18T17:28:29Z

TestDatabaseAccess/AgentState: #17543

nklaassen · 2022-10-28T16:38:39Z

TestWebAgentForward: #17918

nklaassen · 2022-10-28T17:17:18Z

TestIntegrations/TrustedTunnelNode: #17824

ibeckermayer · 2022-10-28T18:40:46Z

FTR its no longer expected that this issue be updated with each new flaky test. To easily see/search all the flaky tests, bookmark this url.

russjones added the bug label Dec 19, 2021

russjones self-assigned this Dec 19, 2021

russjones added flaky tests and removed bug labels Dec 19, 2021

russjones changed the title ~~Flakey Test Tracker~~ Flaky Test Tracker Dec 19, 2021

lxea mentioned this issue Dec 21, 2021

Use require.Eventually to avoid flakiness in TestAPILockedOut #9513

Merged

zmb3 mentioned this issue Dec 28, 2021

Flaky tests #4460

Closed

zmb3 mentioned this issue Feb 9, 2022

Attempt to deflake TestFnCacheSanity #10250

Merged

jimbishopp mentioned this issue Feb 15, 2022

Fix Flaky TestProcessKubeCSR #10355

Merged

jimbishopp mentioned this issue Feb 15, 2022

Add TestModules #10369

Merged

greedy52 mentioned this issue Mar 4, 2022

fix flaky integration test: TestDatabaseAccessMongoConnectionCount #10869

Merged

strideynet mentioned this issue Jul 29, 2022

Test flakes #4653

Closed

15 tasks

ibeckermayer mentioned this issue Sep 12, 2022

TestDatabaseRootLeafIdleTimeout/leaf_role_with_idle_timeout flakiness #16347

Closed

ibeckermayer closed this as completed Oct 28, 2022

Flaky Test Tracker #9492

Flaky Test Tracker #9492

Comments

russjones commented Dec 19, 2021 • edited by ibeckermayer

Investigating

Process

Unit Tests

Integration

Metrics

Proposed for Quarantine

Fixed

fspmarshall commented Dec 20, 2021 • edited

tcsc commented Dec 20, 2021 • edited

zmb3 commented Dec 21, 2021

russjones commented Dec 22, 2021

zmb3 commented Dec 29, 2021

rosstimothy commented Jan 4, 2022

russjones commented Jan 5, 2022

russjones commented Jan 31, 2022 • edited

tcsc commented Jan 31, 2022

zmb3 commented Feb 4, 2022 • edited

zmb3 commented Feb 9, 2022

zmb3 commented Mar 14, 2022 • edited

zmb3 commented Mar 16, 2022 • edited

nklaassen commented Jul 14, 2022

nklaassen commented Jul 15, 2022

nklaassen commented Jul 16, 2022

nklaassen commented Jul 16, 2022

nklaassen commented Jul 16, 2022

ravicious commented Jul 21, 2022

zmb3 commented Jul 25, 2022

rosstimothy commented Jul 28, 2022

zmb3 commented Jul 29, 2022

zmb3 commented Aug 1, 2022

nklaassen commented Aug 12, 2022

ibeckermayer commented Aug 14, 2022

GavinFrazar commented Aug 18, 2022

ibeckermayer commented Aug 24, 2022

Joerger commented Aug 24, 2022

ibeckermayer commented Aug 25, 2022

espadolini commented Aug 29, 2022 • edited by smallinsky

strideynet commented Sep 1, 2022

GavinFrazar commented Sep 3, 2022

ibeckermayer commented Sep 5, 2022

ibeckermayer commented Sep 7, 2022

nklaassen commented Sep 30, 2022

tigrato commented Oct 10, 2022

nklaassen commented Oct 18, 2022

nklaassen commented Oct 28, 2022

nklaassen commented Oct 28, 2022

ibeckermayer commented Oct 28, 2022 • edited

russjones commented Dec 19, 2021 •

edited by ibeckermayer

fspmarshall commented Dec 20, 2021 •

edited

tcsc commented Dec 20, 2021 •

edited

russjones commented Jan 31, 2022 •

edited

zmb3 commented Feb 4, 2022 •

edited

zmb3 commented Mar 14, 2022 •

edited

zmb3 commented Mar 16, 2022 •

edited

espadolini commented Aug 29, 2022 •

edited by smallinsky

ibeckermayer commented Oct 28, 2022 •

edited