Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry long-poll failures, with backoff #184

Closed
gnprice opened this issue Jun 14, 2023 · 1 comment · Fixed by #515
Closed

Retry long-poll failures, with backoff #184

gnprice opened this issue Jun 14, 2023 · 1 comment · Fixed by #515
Assignees
Labels
a-api Implementing specific parts of the Zulip server API beta feedback Things beta users have specifically asked for
Milestone

Comments

@gnprice
Copy link
Member

gnprice commented Jun 14, 2023

A Zulip client long-polls an event queue on the server to get events. In the current prototype, if that request fails for any reason, the long-poll loop just aborts:

  void poll() async {
    while (true) {
      final result = await getEvents(connection,
        queueId: queueId, lastEventId: lastEventId);
      // …

Instead, if the request fails with an error that we can expect to be transient, we should retry, with appropriate backoff.

split off as #514: [We should also ensure that if the request doesn't come back within an appropriate timeout, then we treat it as having failed and retry. The Zulip server should always respond to a long-poll within at most about 60 seconds, with a heartbeat event if there's no information to convey. So if it's been much longer than that and we haven't gotten a response, then we should assume none is coming — the request or response must have been lost somewhere. (This part might get split out as a separate issue.)]

For which types of errors we can expect to be transient and should therefore retry, see zulip-mobile at tryFetch:

            if (!(shouldRetry && (e instanceof Server5xxError || e instanceof NetworkError))) {
              throw e;

For how to control backoff, see zulip-mobile at BackoffMachine, background discussion at zulip/zulip-mobile#3841 (comment) , and the use of BackoffMachine in tryFetch. I think we can basically transcribe BackoffMachine from JS to Dart and reuse that.

@gnprice gnprice added the a-api Implementing specific parts of the Zulip server API label Jun 14, 2023
@gnprice gnprice added this to the Alpha milestone Jun 14, 2023
@gnprice gnprice modified the milestones: Alpha, Launch Sep 22, 2023
@gnprice gnprice added the beta feedback Things beta users have specifically asked for label Jan 26, 2024
@gnprice gnprice modified the milestones: Launch, Beta 2 Jan 26, 2024
@gnprice
Copy link
Member Author

gnprice commented Jan 26, 2024

This continues to happen sometimes, even after fixing #184 which covered the most common reason for failure. When it does, the symptoms are pretty confusing:
https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/.22mark.20as.20read.22.20issue.20.28Flutter.29/near/1726533

So we should fix this.

The main complexity in the code here will be in doing appropriate backoff. But hopefully that will be a small amount of work, through reusing the work we did in zulip-mobile:

I think we can basically transcribe BackoffMachine from JS to Dart and reuse that.

gnprice added a commit to gnprice/zulip-flutter that referenced this issue Feb 13, 2024
The new TODO comments are all things that we don't do today in
zulip-mobile, and haven't been aware of causing problems for people.
So there's no rush to do them here.

Fixes: zulip#184
@gnprice gnprice self-assigned this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-api Implementing specific parts of the Zulip server API beta feedback Things beta users have specifically asked for
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant