Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time out sending a message, and indicate failure #3829

Open
gnprice opened this issue Jan 21, 2020 · 8 comments
Open

Time out sending a message, and indicate failure #3829

gnprice opened this issue Jan 21, 2020 · 8 comments
Labels
a-compose/send Compose box, autocomplete, camera/upload, outbox, sending

Comments

@gnprice
Copy link
Member

gnprice commented Jan 21, 2020

Currently, when we try to send a message and can't, we keep the message around (as an Outbox message) and keep retrying indefinitely until it succeeds.

(Specifically, we do so in a loop with backoff, then at each subsequent message send and at each "initial fetch", so in particular at each launch of the app. See sendOutbox and its call sites.)

This isn't great, because most messages are eventually better to send never than send late. In particular a short conversational reply like "sure" or "no, sorry" may have a completely different meaning if the discussion has moved on.

Instead, when the oldest outbox message reaches a certain age (perhaps 60 seconds? 120 seconds?) and we've been unable to send, we should stop trying to send it; give the user feedback that it failed; and give an option to retry the message or delete it. This is similar to the experience in an SMS client, or in SMS-style chat apps like FB Messenger.

The feedback is important (so that users can be confident that messages are delivered whenever they don't see the feedback), and a bit tricky UI-wise:

  • In the message list, the failed-outbox message should continue to show up, but instead of a spinner should have a "warning" or "problem" indicator. E.g., a red ⚠ (U+2640 WARNING SIGN, ⚠️) or similar.
    • Perhaps also a bit of text right there with an error message like "Failed to send this message. Tap for options." (Or maybe long-press instead?)
    • On tap (or long-press), we'd show an action sheet with two options labelled like "Retry" and "Delete".
  • Here's the trickier part: what if the user hit send and then navigated away, and assumed the message got sent? We'll want to somehow call their attention to the fact it didn't, so they can decide what to do.
    • I think what an SMS-style app would do here is make the situation visible in the conversation list.
    • For us: maybe treat the failed-send message as unread?
@gnprice gnprice added the a-compose/send Compose box, autocomplete, camera/upload, outbox, sending label Jan 21, 2020
@gnprice
Copy link
Member Author

gnprice commented Jan 21, 2020

See also #3731 .

@rk-for-zulip
Copy link
Contributor

  • On tap (or long-press), we'd show an action sheet with two options labelled like "Retry" and "Delete".

Also "Edit message", I think.

  • For us: maybe treat the failed-send message as unread?

That would probably be better than nothing, but ideally we should have a distinction (visible and otherwise) between the two. Notably, if the user sends a message and then mutes the relevant topic, they'll probably still want to be informed that the message couldn't be sent.

@chrisbobbe
Copy link
Contributor

chrisbobbe commented Jan 22, 2020

This may warrant a separate issue, but I wonder if progressiveTimeout, which is used here, a) might be improved with the option to make the returned promise reject after a certain number of tries (or on a duration-based condition), and b) is doing what it's currently intended to do.

a) is more related to this issue. We could handle message sending failures locally, by breaking out of the loop and not calling progressiveTimeout after a certain condition. Or, we could build into progressiveTimeout an option to reject after a threshold is reached (maybe based on the number of tries, or most recent or cumulative duration), which makes sense conceptually as something we may want to do at other sites. Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet. Thinking about the global nature of progressiveTimeout becomes more urgent if we decide to reject the promise, say, when the most recent timeout from any call to progressiveTimeout was 60 seconds (or at some other threshold).

For b), it appears that progressiveTimeout's current intention is, when run in a loop, to do an exponential backoff up to and including a delay that exceeds 60s, and then reset to zero and repeat. This resetting is explained at the end of the JSDoc, but there may be an argument for making that note more prominent, as I didn't assume it would reset to zero:

/**
 * Sleep for a timeout that progressively grows in duration.
 *
 * Starts at 0 on the first call.  Grows initially by a small increment,
 * then exponentially, up to a maximum.
 *
 * If the last call was over 60s ago, starts over at 0.
 */

In any case, from reading the code and as confirmed empirically, this resetting never actually happens. The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer. (EDIT: N.B., since it's global, as mentioned earlier, this looks to be a 10-second timer on all requests that use progressiveTimeout, every time, after that initial backoff.)

@chrisbobbe
Copy link
Contributor

chrisbobbe commented Jan 22, 2020

However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Hang on. Even if request A succeeds after several increasing progressiveTimeouts, request B's first timeout will not be zero, right — it'll be longer than A's last one!

@gnprice
Copy link
Member Author

gnprice commented Jan 24, 2020

The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer.

The intention is that it backs off up to a maximum interval (10s), then stays there as long as we're in a continuous retry loop; and then in the future, after that retry loop is over and things might be all working again, starts the backoff sequence over from zero.

The 60s thing isn't super principled, but I think it can basically be seen as an attempt to implement "that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best."

Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Yeah, I agree. I think it'd be better for it to track its state for one retry loop at a time, rather than globally.

A good way to structure that might be to stick the state in a closure, something like

export const makeBackoffMachine = () => {
  let lastTimeoutTime = 0;
  let lastDuration = 0;

  return {
    wait(): Promise<void> {
      // ... the body of the existing `progressiveTimeout` ...
    }
  };
};

and then callers would look like

const backoffMachine = makeBackoffMachine();
while (true) {
  // ...

  await backoffMachine.wait();
}

Hmm, yeah, and this should probably be a separate issue thread. 😉

@chrisbobbe
Copy link
Contributor

chrisbobbe commented Jan 24, 2020

OK, thanks for the clarification and suggestion! I'm working on an implementation of this; I'll post in a new issue soon. I think I understand the 60s much better now. I believe, with the suggested implementation, that "a new retry loop" in

"that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best"

will be determined much more easily and accurately (i.e., when a new backoff machine has been made), so the 60s heuristic won't actually be needed, and it'll all be clearer and more deterministic without it.

Opened as #3840.

chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 25, 2020
Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on zulip#3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 25, 2020
Fixes: 3840.

Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on zulip#3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 25, 2020
Fixes: 3840.

Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on zulip#3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 27, 2020
Fixes: zulip#3840.

Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on zulip#3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 31, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout used a global state, which had the benefit of
enabling a general throttle affecting all request retry loops that
used it. But different requests may have transient failures for
different reasons, so per-request state handling makes more sense,
and high request traffic is still mitigated by exponential backoff.

Also, previously, any call to progressiveTimeout was
nondeterministic, because other calls may have been done recently,
affecting the delay duration. A 60-second threshold was used as a
heuristic to distinguish request retry loops from each other, but
this is more effectively done by managing the state per-loop in the
first place, and the next sleep duration becomes a pure function of
the number of sleeps completed.

Preparation for zulip#3829.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Jan 31, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout used a global state, which had the benefit of
enabling a general throttle affecting all request retry loops that
used it. But different requests may have transient failures for
different reasons, so per-request state handling makes more sense,
and high request traffic is still mitigated by exponential backoff.

Also, previously, any call to progressiveTimeout was
nondeterministic, because other calls may have been done recently,
affecting the delay duration. A 60-second threshold was used as a
heuristic to distinguish request retry loops from each other, but
this is more effectively done by managing the state per-loop in the
first place, and the next sleep duration becomes a pure function of
the number of sleeps completed.

Preparation for zulip#3829.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Feb 5, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout used a global state, which had the benefit of
enabling a general throttle affecting all request retry loops that
used it. But different requests may have transient failures for
different reasons, so per-request state handling makes more sense,
and high request traffic is still mitigated by exponential backoff.

Also, previously, any call to progressiveTimeout was
nondeterministic, because other calls may have been done recently,
affecting the delay duration. A 60-second threshold was used as a
heuristic to distinguish request retry loops from each other, but
this is more effectively done by managing the state per-loop in the
first place, and the next sleep duration becomes a pure function of
the number of sleeps completed.

Preparation for zulip#3829.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Feb 20, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state
locally. Capped exponential backoff still mitigates high request
traffic, but per-request.

Preparation for zulip#3829.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Feb 27, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state
locally. Capped exponential backoff still mitigates high request
traffic, but per-request.

Preparation for zulip#3829.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Mar 4, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state locally. Capped
exponential backoff still mitigates high request traffic, but
per-request.

Preparation for zulip#3829, to be completed in this series of commits.
chrisbobbe pushed a commit to chrisbobbe/zulip-mobile that referenced this issue Mar 4, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state locally. Capped
exponential backoff still mitigates high request traffic, but
per-request.

Preparation for zulip#3829, to be completed in this series of commits.
rk-for-zulip pushed a commit that referenced this issue Apr 10, 2020
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state locally. Capped
exponential backoff still mitigates high request traffic, but
per-request.

Preparation for #3829, to be completed in this series of commits.
@chrisbobbe
Copy link
Contributor

#3841 has been merged, clearing the way for work on this issue. Anyone interested in claiming this?

@rk-for-zulip rk-for-zulip self-assigned this Apr 16, 2020
@rk-for-zulip
Copy link
Contributor

Let's say me. (I'm already working on #3881, which has a fair range of intersection, and I think if someone else claimed it we'd be stepping on each other's toes a lot.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-compose/send Compose box, autocomplete, camera/upload, outbox, sending
Projects
None yet
Development

No branches or pull requests

3 participants