-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(client): retry api version discovery for at least 30 seconds #5290
chore(client): retry api version discovery for at least 30 seconds #5290
Conversation
Needs rebase :( |
Will this fix #5286? |
I believe something else already fixed it in @joschisan's PR. But if it had been there, it would probably also prevented it. |
During debugging with joschisan, we've spotted a case where in a test the Federation was not yet fully up (peer api servers were down), yet the test Client was already attempting to join with a pre-prepared config. Normally in this case fetching the config would fail, but that was passed straight from the server fixture, so api discovery call was actually the first call. It queried all peers, all were down, the request logic would re-attempt connection but only once, which instantly failed as well, failing the whole initial api version discovery and disabling all client modules. While in this case it seems like a combination of test failing to setup Federation before attempting to run client code, it makes me think that in real life there might be cases where the client is joining new Federation, gets a config, and then e.g. switches from wifi to GSM, etc. and has a brief period of being offline. Again - this is only ever a problem on a first time client is joining. But to prevent such a hiccups, in this change, I'm making it so the client will attempt at least for a certain period to discover the version before giving it up, to lower the chances of such an unfortunate UX.
d80fc13
to
784b97c
Compare
@dpc I did not fix this in general, just prevented this by awaiting the api to come online. |
Yeah The client can't join federation if it's not online yet. Waiting for it in the test fixture is the right fix, IMO. |
let deadline = now() + Duration::from_secs(30); | ||
|
||
loop { | ||
let res = | ||
Self::discover_common_api_version_static_try(config, client_module_init, api, mode) | ||
.await; | ||
|
||
if res.is_ok() { | ||
return res; | ||
} | ||
|
||
if deadline < now() { | ||
return res; | ||
} | ||
debug!(target: LOG_CLIENT, "Retrying getting api version from Federation"); | ||
sleep(Duration::from_secs(1)).await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: #5323
During debugging with joschisan, we've spotted a case where in a test the Federation was not yet fully up (peer api servers were down), yet the test Client was already attempting to join with a pre-prepared config.
Normally in this case fetching the config would fail, but that was passed straight from the server fixture, so api discovery call was actually the first call. It queried all peers, all were down, the request logic would re-attempt connection but only once, which instantly failed as well, failing the whole initial api version discovery and disabling all client modules.
While in this case it seems like a problem of test failing to setup Federation before attempting to run client code, it makes me think that in real life there might be cases where the client is joining new Federation, gets a config, and then e.g. switches from wifi to GSM, etc. and has a brief period of being offline.
Again - this is only ever a problem on a first time client is joining. But to prevent such a hiccups, in this change, I'm making it so the client will attempt at least for a certain period to discover the version before giving it up, to lower the chances of such an unfortunate UX.