Hopefully fix flaky Consul fencing test #23280

banks · 2023-09-26T11:01:28Z

I've seen this test fail in our Enterprise repo several times now with:

2023-09-25T15:10:52.9999552Z     consul_fencing_test.go:168:
2023-09-25T15:10:53.0000151Z         	Error Trace:	/home/runner/actions-runner/_work/vault-enterprise/vault-enterprise/vault/external_tests/consul_fencing_binary/consul_fencing_test.go:168
2023-09-25T15:10:53.0000414Z         	Error:      	Received unexpected error:
2023-09-25T15:10:53.0001126Z         	            	client 1 error: unable to perform patch: error performing merge patch to /test/data/data: Error making API request.
2023-09-25T15:10:53.0001258Z
2023-09-25T15:10:53.0001619Z         	            	URL: PATCH https://127.0.0.1:32875/v1/test/data/data
2023-09-25T15:10:53.0001841Z         	            	Code: 400. Errors:
2023-09-25T15:10:53.0001965Z
2023-09-25T15:10:53.0003027Z         	            	* Waiting for the primary to upgrade from non-versioned to versioned data. This backend will be unavailable for a brief period and will resume service when the primary is finished.
2023-09-25T15:10:53.0003275Z         	Test:       	TestConsulFencing_PartitionedLeaderCantWrite

This is due to it relying on the KV-v2 mount which has an asynchronous upgrade procedure on mount which must complete before writes are accepted.

This is true for Community Edition Vault too, however the process is usually quick (I've not been able to recreate the failure locally in either version). It fails more frequently in CI in Enterprise because in Enterprise the non-active nodes are performance standbys that not only have to wait for the primary to complete the upgrade, they also have to notice that by polling some state which increases the chances that the first write to one will error.

This fix should avoid that in either case by spending a few seconds ensuring that writes are available through a non-active node before we start the main body of the test where we treat errors more seriously.

Edit: Nick brought to my attention that we fixed mount upgrades to be synchronous for empty mounts last year which explains why this test never flakes in CE or on the active node in Ent. In Ent though the perf standbys are still async checking when the active node is done which gives the race between that and the first request.

github-actions · 2023-09-26T11:19:43Z

Build Results:
All builds succeeded! ✅

github-actions · 2023-09-26T11:23:11Z

CI Results:
All Go tests succeeded! ✅

miagilepner · 2023-09-26T11:49:32Z

vault/external_tests/consul_fencing_binary/consul_fencing_test.go

+		if err == nil {
+			return
+		}
+		t.Logf("waitForKVv2Upgrade: write faile: %s", err)


it might be worth checking the error to verify that it is an upgrade error, and erroring out if it's something else.

also, typo for the word faile

Do we need this? Maybe we could just include an aggregate of all the error messages to the fatal message in ctx.Done case?

I originally thought I wouldn't sniff the error message just to be more robust (both against other errors that are similarly transient during setup and against us changing the specific error message in this case), but I guess it might be nicer to fail fast?

If we catch some other non-transient error here we'll at least slow down on the retries and only output it a handful of times before we fail.

What do you think is preferable?

I don't feel strongly about it, I'm fine to merge it as is

vault/external_tests/consul_fencing_binary/consul_fencing_test.go

banks added pr/no-changelog pr/no-milestone labels Sep 26, 2023

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Sep 26, 2023

banks requested a review from a team September 26, 2023 11:02

miagilepner reviewed Sep 26, 2023

View reviewed changes

hghaf099 approved these changes Sep 26, 2023

View reviewed changes

banks commented Sep 27, 2023

View reviewed changes

vault/external_tests/consul_fencing_binary/consul_fencing_test.go Outdated Show resolved Hide resolved

banks added 2 commits September 27, 2023 16:11

Hopefully fix flaky fencing test when run in Enterprise

e2c4b1f

Fix typo

a61ccfa

banks force-pushed the fix-ce-fencing-test-flake branch from bf4fabd to a61ccfa Compare September 27, 2023 15:11

banks enabled auto-merge (squash) September 27, 2023 15:11

banks merged commit 9fc67b6 into main Sep 28, 2023
108 checks passed

banks deleted the fix-ce-fencing-test-flake branch September 28, 2023 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hopefully fix flaky Consul fencing test #23280

Hopefully fix flaky Consul fencing test #23280

banks commented Sep 26, 2023 •

edited

github-actions bot commented Sep 26, 2023 •

edited

github-actions bot commented Sep 26, 2023

miagilepner Sep 26, 2023

miagilepner Sep 26, 2023

hghaf099 Sep 26, 2023

banks Sep 26, 2023 •

edited

miagilepner Sep 26, 2023

Hopefully fix flaky Consul fencing test #23280

Hopefully fix flaky Consul fencing test #23280

Conversation

banks commented Sep 26, 2023 • edited

github-actions bot commented Sep 26, 2023 • edited

github-actions bot commented Sep 26, 2023

miagilepner Sep 26, 2023

Choose a reason for hiding this comment

miagilepner Sep 26, 2023

Choose a reason for hiding this comment

hghaf099 Sep 26, 2023

Choose a reason for hiding this comment

banks Sep 26, 2023 • edited

Choose a reason for hiding this comment

miagilepner Sep 26, 2023

Choose a reason for hiding this comment

banks commented Sep 26, 2023 •

edited

github-actions bot commented Sep 26, 2023 •

edited

banks Sep 26, 2023 •

edited