New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hopefully fix flaky Consul fencing test #23280
Conversation
Build Results: |
CI Results: |
if err == nil { | ||
return | ||
} | ||
t.Logf("waitForKVv2Upgrade: write faile: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth checking the error to verify that it is an upgrade error, and erroring out if it's something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, typo for the word faile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Maybe we could just include an aggregate of all the error messages to the fatal message in ctx.Done
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought I wouldn't sniff the error message just to be more robust (both against other errors that are similarly transient during setup and against us changing the specific error message in this case), but I guess it might be nicer to fail fast?
If we catch some other non-transient error here we'll at least slow down on the retries and only output it a handful of times before we fail.
What do you think is preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about it, I'm fine to merge it as is
vault/external_tests/consul_fencing_binary/consul_fencing_test.go
Outdated
Show resolved
Hide resolved
bf4fabd
to
a61ccfa
Compare
I've seen this test fail in our Enterprise repo several times now with:
This is due to it relying on the KV-v2 mount which has an asynchronous upgrade procedure on mount which must complete before writes are accepted.
This is true for Community Edition Vault too, however the process is usually quick (I've not been able to recreate the failure locally in either version). It fails more frequently in CI in Enterprise because in Enterprise the non-active nodes are performance standbys that not only have to wait for the primary to complete the upgrade, they also have to notice that by polling some state which increases the chances that the first write to one will error.
This fix should avoid that in either case by spending a few seconds ensuring that writes are available through a non-active node before we start the main body of the test where we treat errors more seriously.
Edit: Nick brought to my attention that we fixed mount upgrades to be synchronous for empty mounts last year which explains why this test never flakes in CE or on the active node in Ent. In Ent though the perf standbys are still async checking when the active node is done which gives the race between that and the first request.