Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: don't fail test helpers if cleanup returns a 404 #527

Closed
wants to merge 1 commit into from

Conversation

sebasslash
Copy link
Contributor

Description

We've had a series of flaky tests where the test itself succeeds but the cleanup fails, most notably workspaces deleting their children before the test is able to make the API call to delete a given child resource. We want to reduce the possibility of a flake during cleanup.

The idea here is we shouldn't fail a test if the resource that it's trying to clean up is not found (ErrResourceNotFound). We don't use explicit synchronization when firing off API calls which could explain why parent resources are occasionally deleted first. Rather than to wrangle with explicit ordering, we can simply add a fail-safe to ignore deleting non existent resources.

It is worth mentioning, all remaining resources are destroyed when the tflocal instance is rebuilt nightly so these dangling resources don't sit around for too long. We don't however ignore all deletion errors since they can potentially raise an API level concern.

External links

@sebasslash sebasslash requested a review from a team as a code owner September 7, 2022 16:08
Copy link
Collaborator

@brandonc brandonc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this targets 404 errors only, and I only have one reservation: 404 is also used when an authorization error occurs so that could potentially hide a bug that the test runner does not have authorization to delete something, which would be surprising.

I think I am leaning toward being against this change because proper ordering (deleting the thing the workspace owns before deleting the workspace, for instance) and using t.Cleanup so ordering is simplified eliminates the problem. I don't think this has the potential to resolve flakes that are otherwise successful tests, but I think it has the potential to obscure test structure bugs or authorization bugs.

@Uk1288
Copy link
Contributor

Uk1288 commented Sep 8, 2022

Good point, sounds like we might have to handle this on a case by case flake. Given that some of the clean up failures are difficult to reproduce, I am thinking we should monitor and fix them as they occur.

@sebasslash
Copy link
Contributor Author

Hmm, I don't think this obscures an authorization bug because we always test the delete method for a given interface -- which if in the case it was an auth bug, those tests would fail as well. I think if we're getting a 404 during the cleanup, but the related delete integration test succeeds, it is an actual 404.

@laurenolivia
Copy link
Contributor

@sebasslash The StateVersion flake failing your PR has been fixed 😸

@sebasslash
Copy link
Contributor Author

sebasslash commented Sep 13, 2022

Alrighty, there haven't been any more comments and it seems we're set on failing tests with 404s on cleanup. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants