Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hopefully fix flaky Consul fencing test #23280

Merged
merged 2 commits into from Sep 28, 2023
Merged

Hopefully fix flaky Consul fencing test #23280

merged 2 commits into from Sep 28, 2023

Conversation

banks
Copy link
Member

@banks banks commented Sep 26, 2023

I've seen this test fail in our Enterprise repo several times now with:

2023-09-25T15:10:52.9999552Z     consul_fencing_test.go:168:
2023-09-25T15:10:53.0000151Z         	Error Trace:	/home/runner/actions-runner/_work/vault-enterprise/vault-enterprise/vault/external_tests/consul_fencing_binary/consul_fencing_test.go:168
2023-09-25T15:10:53.0000414Z         	Error:      	Received unexpected error:
2023-09-25T15:10:53.0001126Z         	            	client 1 error: unable to perform patch: error performing merge patch to /test/data/data: Error making API request.
2023-09-25T15:10:53.0001258Z
2023-09-25T15:10:53.0001619Z         	            	URL: PATCH https://127.0.0.1:32875/v1/test/data/data
2023-09-25T15:10:53.0001841Z         	            	Code: 400. Errors:
2023-09-25T15:10:53.0001965Z
2023-09-25T15:10:53.0003027Z         	            	* Waiting for the primary to upgrade from non-versioned to versioned data. This backend will be unavailable for a brief period and will resume service when the primary is finished.
2023-09-25T15:10:53.0003275Z         	Test:       	TestConsulFencing_PartitionedLeaderCantWrite

This is due to it relying on the KV-v2 mount which has an asynchronous upgrade procedure on mount which must complete before writes are accepted.

This is true for Community Edition Vault too, however the process is usually quick (I've not been able to recreate the failure locally in either version). It fails more frequently in CI in Enterprise because in Enterprise the non-active nodes are performance standbys that not only have to wait for the primary to complete the upgrade, they also have to notice that by polling some state which increases the chances that the first write to one will error.

This fix should avoid that in either case by spending a few seconds ensuring that writes are available through a non-active node before we start the main body of the test where we treat errors more seriously.

Edit: Nick brought to my attention that we fixed mount upgrades to be synchronous for empty mounts last year which explains why this test never flakes in CE or on the active node in Ent. In Ent though the perf standbys are still async checking when the active node is done which gives the race between that and the first request.

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Sep 26, 2023
@banks banks requested a review from a team September 26, 2023 11:02
@github-actions
Copy link

github-actions bot commented Sep 26, 2023

Build Results:
All builds succeeded! ✅

@github-actions
Copy link

CI Results:
All Go tests succeeded! ✅

if err == nil {
return
}
t.Logf("waitForKVv2Upgrade: write faile: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth checking the error to verify that it is an upgrade error, and erroring out if it's something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, typo for the word faile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Maybe we could just include an aggregate of all the error messages to the fatal message in ctx.Done case?

Copy link
Member Author

@banks banks Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought I wouldn't sniff the error message just to be more robust (both against other errors that are similarly transient during setup and against us changing the specific error message in this case), but I guess it might be nicer to fail fast?

If we catch some other non-transient error here we'll at least slow down on the retries and only output it a handful of times before we fail.

What do you think is preferable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about it, I'm fine to merge it as is

@banks banks enabled auto-merge (squash) September 27, 2023 15:11
@banks banks merged commit 9fc67b6 into main Sep 28, 2023
108 checks passed
@banks banks deleted the fix-ce-fencing-test-flake branch September 28, 2023 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed pr/no-changelog pr/no-milestone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants