Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test Tracker #9492

Closed
russjones opened this issue Dec 19, 2021 · 83 comments
Closed

Flaky Test Tracker #9492

russjones opened this issue Dec 19, 2021 · 83 comments
Assignees

Comments

@russjones
Copy link
Contributor

russjones commented Dec 19, 2021

Investigating

Process

  1. Start with the assumption the test is correct and is highlighting a bug in Teleport.
  2. Run multiple parallel unit or integration tests to reproduce.
  3. Attempt to fix the test.
  4. Propose quarantine.

Unit Tests

  • Frequently fails. github.com/gravitational/teleport/lib/service.TestTeleportProcess_reconnectToAuth
  • Frequently fails. github.com/gravitational/teleport/lib/srv/regular.TestClientDisconnect
  • Frequently fails. github.com/gravitational/teleport/lib/cache.TestCache_Backoff
  • github.com/gravitational/teleport/lib/srv/regular.TestProxyReverseTunnel
  • github.com/gravitational/teleport/lib/auth.TestAPILockedOut
  • github.com/gravitational/teleport/lib/auth.TestAPI
  • Often fails locally github.com/gravitational/teleport/lib/auth.TestTiming (also reported in Test flakes #4653)

Integration

Metrics

Trailing 7-day pass rate for unit and integration tests.

  • Week of January 10th. Unit 67%, Integration 56%

Proposed for Quarantine

This section is for tests that provide business value but are inherently flaky due to a dependence on time and an external resource (like CPU or network). For example, a test that waits for an event to occur and times out if the event does not occur after some time.

Quarantined tests will be triaged by @russjones weekly and potentially serialized and put into a retry loop.

Fixed

@russjones russjones added the bug label Dec 19, 2021
@russjones russjones self-assigned this Dec 19, 2021
@russjones russjones added flaky tests and removed bug labels Dec 19, 2021
@russjones russjones changed the title Flakey Test Tracker Flaky Test Tracker Dec 19, 2021
@fspmarshall
Copy link
Contributor

fspmarshall commented Dec 20, 2021

Found a race in TwoClustersTunnel which I believe is the cause of this error:

failed connecting to node localhost. remote cluster "site-A" is not found

I'm not sure if this is the only problem with TwoClustersTunnel, but I'll push a fix and we can watch to see if its flakiness goes down.

edit: See #9506

@tcsc
Copy link
Contributor

tcsc commented Dec 20, 2021

I'm not sure if this is the only problem with TwoClustersTunnel

I see these quite often, too. May be related to the above.

integration_test.go:1510: 
Error Trace:	integration_test.go:1510
                        integration_test.go:1388
Error:      	Received unexpected error:
       	            	connection error: desc = "transport: Error while dialing failed to dial: failed connecting to node . invalid format for proxy request: unknown cluster \"site-A\"\n"
integration_test.go:1528: 
Error Trace:	integration_test.go:1528
                        integration_test.go:1388
Error:      	Condition never satisfied
Test:       	TestIntegrations/TwoClustersTunnel/proxy
Messages:   	Timed out waiting for Site A to restart

@zmb3
Copy link
Collaborator

zmb3 commented Dec 21, 2021

Is this a duplicate of #4653? Should we combine them?

@russjones
Copy link
Contributor Author

@zmb3 Yeah, I think we should close that one and merge things into this one.

@zmb3 zmb3 mentioned this issue Dec 28, 2021
@zmb3
Copy link
Collaborator

zmb3 commented Dec 29, 2021

TwoClustersTunnel often fails for me with:

failed connecting to node localhost. database is closed

Occasionally I get a similar error that cays cache is closed instead of database.

Interestingly:

  • I can only reproduce when I run with the race detector enabled.
  • If I remove the section of the test that stops site A and restarts it, I see completely different errors

At some point in the tests, I start seeing tons of Uploader scan failed errors. I have a feeling the vast majority of this code is failing to clean up properly and is removing directories while they are still in use.

@rosstimothy
Copy link
Contributor

Opened #9516 to address:

  • github.com/gravitational/teleport/lib/service.TestTeleportProcess_reconnectToAuth
  • github.com/gravitational/teleport/lib/service.TestResourceWatcher_Backoff
  • github.com/gravitational/teleport/lib/cache.TestCache_Backoff

@russjones
Copy link
Contributor Author

For TestIntegrations/TwoClustersTunnel: #9655

@russjones
Copy link
Contributor Author

russjones commented Jan 31, 2022

A good way to reproduce issue is using while. You might also want to run this command in multiple terminal windows.

Example command:

$ while go test . -run TestAccessMongoDB -count=1 -race; do :; done

@tcsc
Copy link
Contributor

tcsc commented Jan 31, 2022

I have a similar script I use called untilfail:

#!/bin/bash

COUNT=1
while "$@"; do COUNT=$((COUNT + 1)); done

echo Ran $COUNT times

Then I can pass it an arbitrary command line:

$ untilfail go test ./integration -run TestSomethingOrOther -race

@zmb3
Copy link
Collaborator

zmb3 commented Feb 4, 2022

I've started seeing TestSSHConfigConnectWithOpenSSHClient failures. Not sure if it's just my environment or not, but figured I'd put a note here. Spent 30 minutes or so debugging and I don't know what's going on.

zmb@localhost: Permission denied (publickey).
ERROR: exit status 255

Edit: It appears that even though this test passes a config file with ssh -F, openssh is still looking at my local ~/.ssh/config file. This means that this tests depends on local system state.

@zmb3
Copy link
Collaborator

zmb3 commented Feb 9, 2022

Here's a new one that I ran into on GCB. Can't seem to repro locally.

OUTPUT github.com/gravitational/teleport/lib/utils.TestFnCacheSanity
===================================================
=== RUN   TestFnCacheSanity
    fncache_test.go:145: 
        	Error Trace:	fncache_test.go:145
        	            				fncache_test.go:71
        	Error:      	Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995
        	Test:       	TestFnCacheSanity
        	Messages:   	ttl=40ms, delay=0s, desc="non-blocking"
--- FAIL: TestFnCacheSanity (1.53s)
===================================================

zmb3 added a commit that referenced this issue Feb 9, 2022
Convert approxReads to an integer (by truncating) before comparing
to actual reads.

This should prevent failures where due to our approximation, we estimate
a fractional number of reads that exceed our tolerance of 1.

Sample error: Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995

Updates #9492
zmb3 added a commit that referenced this issue Feb 15, 2022
Increase tolerance on expected reads.

This should prevent failures where due to our approximation, we estimate
a fractional number of reads that exceed our tolerance of 1.

Sample error: Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995

Updates #9492
zmb3 added a commit that referenced this issue Feb 15, 2022
Increase tolerance on expected reads.

This should prevent failures where due to our approximation, we estimate
a fractional number of reads that exceed our tolerance of 1.

Sample error: Max difference between 10.461059975 and 9 allowed is 1, but difference was 1.4610599749999995

Updates #9492
@zmb3
Copy link
Collaborator

zmb3 commented Mar 14, 2022

@r0mant have you seen this one before TestDatabaseResource?

2022-03-14T20:24:44Z ERRO [AUTH:2]    PID: 55274 Failed to bind to address 127.0.0.1:32963: listen tcp 127.0.0.1:32963: bind: address already in use, exiting. service/service.go:1330
    helpers_test.go:145: 
        	Error Trace:	helpers_test.go:145
        	            				resource_command_test.go:138
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:32963: bind: address already in use
        	Test:       	TestDatabaseResource
--- FAIL: TestDatabaseResource (3.35s)

@zmb3
Copy link
Collaborator

zmb3 commented Mar 16, 2022

Another one: TestDatabaseRootLeafIdleTimeout (cc @smallinsky)

I've seen both of the subtests fail with different errors.

    --- FAIL: TestDatabaseRootLeafIdleTimeout/leaf_role_with_idle_timeoutz
        db_integration_test.go:371: 
        	Error Trace:	db_integration_test.go:371
        	Error:      	An error is expected but got nil.
        	Test:       	TestDatabaseRootLeafIdleTimeout/leaf_role_with_idle_timeout
    --- FAIL: TestDatabaseRootLeafIdleTimeout/root_role_with_idle_timeout (1.21s)
        db_integration_test.go:613: event type "client.disconnect" not found after 1s
FAIL
FAIL    github.com/gravitational/teleport/integration   5.218s
FAIL

Took me 23 runs to reproduce locally, but it does eventually fail.

@nklaassen
Copy link
Contributor

TestBot_Run_CARotation: #14471

@nklaassen
Copy link
Contributor

TestFnCacheSanity: #14534

@nklaassen
Copy link
Contributor

TestAgentStart: #14553

@nklaassen
Copy link
Contributor

TestNormalOperation: #14554

@nklaassen
Copy link
Contributor

TestFnCacheCancellation: #14556

@ravicious
Copy link
Member

TestTokens: #14737 (tool/tctl/common/token_command_test.go)

@zmb3
Copy link
Collaborator

zmb3 commented Jul 25, 2022

TestSemaphoreLock: #14842

@rosstimothy
Copy link
Contributor

Data Race in TestTokenGeneration: #14979

@strideynet strideynet mentioned this issue Jul 29, 2022
15 tasks
@zmb3
Copy link
Collaborator

zmb3 commented Jul 29, 2022

TestIntegrations/ListResourcesAcrossClusters: #15051

@zmb3
Copy link
Collaborator

zmb3 commented Aug 1, 2022

TestIntegrations/SessionRecordingModes/StrictMode: #15097

@nklaassen
Copy link
Contributor

failed to parse existing kubeconfig tsh test flakiness #15490

@ibeckermayer
Copy link
Contributor

TestMain (sshserver_test.go): #15520

@GavinFrazar
Copy link
Contributor

TestAllowedUsers: #15656

@ibeckermayer
Copy link
Contributor

TestRoleDeletionDrift: #15815

@Joerger
Copy link
Contributor

Joerger commented Aug 24, 2022

TestWebSessionsRenewDoesNotBreakExistingTerminalSession: #15816

@ibeckermayer
Copy link
Contributor

tbot race condition: #15843

@espadolini
Copy link
Contributor

espadolini commented Aug 29, 2022

TestPingConnection: #15896

@strideynet
Copy link
Contributor

TestTeleportClient_Login_local/OTP_device_login_with_hijack: #16037

@GavinFrazar
Copy link
Contributor

TestHandlerConnectionUpgrade data race: #16128

@ibeckermayer
Copy link
Contributor

TestGlobalRequestRecordingProxy: #16138

@ibeckermayer
Copy link
Contributor

TestAppAccess/TestAppInvalidateAppSessionsOnLogout: #16200

@nklaassen
Copy link
Contributor

TestRootScript: #16908

@tigrato
Copy link
Contributor

tigrato commented Oct 10, 2022

TestIntegrations/AuditOn: #17224

@nklaassen
Copy link
Contributor

TestDatabaseAccess/AgentState: #17543

@nklaassen
Copy link
Contributor

TestWebAgentForward: #17918

@nklaassen
Copy link
Contributor

TestIntegrations/TrustedTunnelNode: #17824

@ibeckermayer
Copy link
Contributor

ibeckermayer commented Oct 28, 2022

FTR its no longer expected that this issue be updated with each new flaky test. To easily see/search all the flaky tests, bookmark this url.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests