Time out sending a message, and indicate failure #3829

gnprice · 2020-01-21T21:40:14Z

Currently, when we try to send a message and can't, we keep the message around (as an Outbox message) and keep retrying indefinitely until it succeeds.

(Specifically, we do so in a loop with backoff, then at each subsequent message send and at each "initial fetch", so in particular at each launch of the app. See sendOutbox and its call sites.)

This isn't great, because most messages are eventually better to send never than send late. In particular a short conversational reply like "sure" or "no, sorry" may have a completely different meaning if the discussion has moved on.

Instead, when the oldest outbox message reaches a certain age (perhaps 60 seconds? 120 seconds?) and we've been unable to send, we should stop trying to send it; give the user feedback that it failed; and give an option to retry the message or delete it. This is similar to the experience in an SMS client, or in SMS-style chat apps like FB Messenger.

The feedback is important (so that users can be confident that messages are delivered whenever they don't see the feedback), and a bit tricky UI-wise:

In the message list, the failed-outbox message should continue to show up, but instead of a spinner should have a "warning" or "problem" indicator. E.g., a red ⚠ (U+2640 WARNING SIGN, ⚠️) or similar.
- Perhaps also a bit of text right there with an error message like "Failed to send this message. Tap for options." (Or maybe long-press instead?)
- On tap (or long-press), we'd show an action sheet with two options labelled like "Retry" and "Delete".
Here's the trickier part: what if the user hit send and then navigated away, and assumed the message got sent? We'll want to somehow call their attention to the fact it didn't, so they can decide what to do.
- I think what an SMS-style app would do here is make the situation visible in the conversation list.
- For us: maybe treat the failed-send message as unread?

The text was updated successfully, but these errors were encountered:

gnprice · 2020-01-21T21:47:09Z

See also #3731 .

rk-for-zulip · 2020-01-21T22:23:46Z

On tap (or long-press), we'd show an action sheet with two options labelled like "Retry" and "Delete".

Also "Edit message", I think.

For us: maybe treat the failed-send message as unread?

That would probably be better than nothing, but ideally we should have a distinction (visible and otherwise) between the two. Notably, if the user sends a message and then mutes the relevant topic, they'll probably still want to be informed that the message couldn't be sent.

chrisbobbe · 2020-01-22T19:18:32Z

This may warrant a separate issue, but I wonder if progressiveTimeout, which is used here, a) might be improved with the option to make the returned promise reject after a certain number of tries (or on a duration-based condition), and b) is doing what it's currently intended to do.

a) is more related to this issue. We could handle message sending failures locally, by breaking out of the loop and not calling progressiveTimeout after a certain condition. Or, we could build into progressiveTimeout an option to reject after a threshold is reached (maybe based on the number of tries, or most recent or cumulative duration), which makes sense conceptually as something we may want to do at other sites. Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet. Thinking about the global nature of progressiveTimeout becomes more urgent if we decide to reject the promise, say, when the most recent timeout from any call to progressiveTimeout was 60 seconds (or at some other threshold).

For b), it appears that progressiveTimeout's current intention is, when run in a loop, to do an exponential backoff up to and including a delay that exceeds 60s, and then reset to zero and repeat. This resetting is explained at the end of the JSDoc, but there may be an argument for making that note more prominent, as I didn't assume it would reset to zero:

/**
 * Sleep for a timeout that progressively grows in duration.
 *
 * Starts at 0 on the first call.  Grows initially by a small increment,
 * then exponentially, up to a maximum.
 *
 * If the last call was over 60s ago, starts over at 0.
 */

In any case, from reading the code and as confirmed empirically, this resetting never actually happens. The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer. (EDIT: N.B., since it's global, as mentioned earlier, this looks to be a 10-second timer on all requests that use progressiveTimeout, every time, after that initial backoff.)

chrisbobbe · 2020-01-22T23:03:26Z

However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Hang on. Even if request A succeeds after several increasing progressiveTimeouts, request B's first timeout will not be zero, right — it'll be longer than A's last one!

gnprice · 2020-01-24T02:08:51Z

The resetPeriod of 60 seconds would trigger the reset, but the duration never reaches that 60-second threshold because the maxDuration is 10 seconds. After a brief period of proper backoff, progressiveTimeout just becomes a 10-second timer.

The intention is that it backs off up to a maximum interval (10s), then stays there as long as we're in a continuous retry loop; and then in the future, after that retry loop is over and things might be all working again, starts the backoff sequence over from zero.

The 60s thing isn't super principled, but I think it can basically be seen as an attempt to implement "that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best."

Before doing this, I think there are some implications of the global nature of progressiveTimeout that would become more urgent to tease out. On the one hand, the fact that it's global is nice because, as long as progressiveTimeout is used at all appropriate call sites, it throttles the volume of all requests, not just those from a particular site, so it reduces general network activity and pressure on the server. However, different kinds of requests may have transient failures for different reasons, right, and I'm not sure we always want request B's timeout to start at, say, 40 seconds, just because request A hasn't succeeded yet.

Yeah, I agree. I think it'd be better for it to track its state for one retry loop at a time, rather than globally.

A good way to structure that might be to stick the state in a closure, something like

export const makeBackoffMachine = () => {
  let lastTimeoutTime = 0;
  let lastDuration = 0;

  return {
    wait(): Promise<void> {
      // ... the body of the existing `progressiveTimeout` ...
    }
  };
};

and then callers would look like

const backoffMachine = makeBackoffMachine();
while (true) {
  // ...

  await backoffMachine.wait();
}

Hmm, yeah, and this should probably be a separate issue thread. 😉

chrisbobbe · 2020-01-24T19:28:08Z

OK, thanks for the clarification and suggestion! I'm working on an implementation of this; I'll post in a new issue soon. I think I understand the 60s much better now. I believe, with the suggested implementation, that "a new retry loop" in

"that retry loop finished, one way or another, and this is a new retry loop where we should start by assuming the best"

will be determined much more easily and accurately (i.e., when a new backoff machine has been made), so the 60s heuristic won't actually be needed, and it'll all be clearer and more deterministic without it.

Opened as #3840.

Replace progressiveTimeout with makeBackoffMachine, which is called above a loop, and .wait() can be called inside the loop on the returned backoffMachine. Give access to numSleepsCompleted and timeElapsedSoFar so "give up" logic (e.g., on too many network requests with transient failures) can be handled near .wait() call sites. Prepare for work on zulip#3829 (a permanent timeout on sending outbox messages). Previously, the state logic in progressiveTimeout was global, which meant a potentially desirable general throttle on all network requests that used it, but the drawbacks were greater. A particular call to progressiveTimeout was nondeterministic, because other calls may have been made recently, affecting the resulting delay duration. A 60-second threshold was used as a heuristic to distinguish request retry loops from each other, but this is much more effectively done with the new interface.

Fixes: 3840. Replace progressiveTimeout with makeBackoffMachine, which is called above a loop, and .wait() can be called inside the loop on the returned backoffMachine. Give access to numSleepsCompleted and timeElapsedSoFar so "give up" logic (e.g., on too many network requests with transient failures) can be handled near .wait() call sites. Prepare for work on zulip#3829 (a permanent timeout on sending outbox messages). Previously, the state logic in progressiveTimeout was global, which meant a potentially desirable general throttle on all network requests that used it, but the drawbacks were greater. A particular call to progressiveTimeout was nondeterministic, because other calls may have been made recently, affecting the resulting delay duration. A 60-second threshold was used as a heuristic to distinguish request retry loops from each other, but this is much more effectively done with the new interface.

Fixes: zulip#3840. Replace progressiveTimeout with makeBackoffMachine, which is called above a loop, and .wait() can be called inside the loop on the returned backoffMachine. Give access to numSleepsCompleted and timeElapsedSoFar so "give up" logic (e.g., on too many network requests with transient failures) can be handled near .wait() call sites. Prepare for work on zulip#3829 (a permanent timeout on sending outbox messages). Previously, the state logic in progressiveTimeout was global, which meant a potentially desirable general throttle on all network requests that used it, but the drawbacks were greater. A particular call to progressiveTimeout was nondeterministic, because other calls may have been made recently, affecting the resulting delay duration. A 60-second threshold was used as a heuristic to distinguish request retry loops from each other, but this is much more effectively done with the new interface.

Improve state logic for sleep durations in API request retry loops. progressiveTimeout used a global state, which had the benefit of enabling a general throttle affecting all request retry loops that used it. But different requests may have transient failures for different reasons, so per-request state handling makes more sense, and high request traffic is still mitigated by exponential backoff. Also, previously, any call to progressiveTimeout was nondeterministic, because other calls may have been done recently, affecting the delay duration. A 60-second threshold was used as a heuristic to distinguish request retry loops from each other, but this is more effectively done by managing the state per-loop in the first place, and the next sleep duration becomes a pure function of the number of sleeps completed. Preparation for zulip#3829.

Improve state logic for sleep durations in API request retry loops. progressiveTimeout uses a global state, with the effect that one network request's retry attempts affect the sleep durations used for any other request. This has the benefit of enabling a general throttle on all request retry loops that used progressiveTimeout. But different requests may have transient failures for different reasons, so it's not obvious that we want this general throttle. And hard-to-trace bugs can creep in when the behavior of progressiveTimeout can't be determined from a particular call site. Also, progressiveTimeout uses a 60-second threshold as a heuristic to distinguish request retry loops from each other, with the simplifying assumption that different types of requests will not occur within 60 seconds of each other. This distinction is more effectively done by managing the state per-loop in the first place, and doing so eliminates more of those hard-to-trace bugs mentioned in the previous paragraph. So, introduce the ability to handle state locally. Capped exponential backoff still mitigates high request traffic, but per-request. Preparation for zulip#3829.

Improve state logic for sleep durations in API request retry loops. progressiveTimeout uses a global state, with the effect that one network request's retry attempts affect the sleep durations used for any other request. This has the benefit of enabling a general throttle on all request retry loops that used progressiveTimeout. But different requests may have transient failures for different reasons, so it's not obvious that we want this general throttle. And hard-to-trace bugs can creep in when the behavior of progressiveTimeout can't be determined from a particular call site. Also, progressiveTimeout uses a 60-second threshold as a heuristic to distinguish request retry loops from each other, with the simplifying assumption that different types of requests will not occur within 60 seconds of each other. This distinction is more effectively done by managing the state per-loop in the first place, and doing so eliminates more of those hard-to-trace bugs mentioned in the previous paragraph. So, introduce the ability to handle state locally. Capped exponential backoff still mitigates high request traffic, but per-request. Preparation for zulip#3829, to be completed in this series of commits.

Improve state logic for sleep durations in API request retry loops. progressiveTimeout uses a global state, with the effect that one network request's retry attempts affect the sleep durations used for any other request. This has the benefit of enabling a general throttle on all request retry loops that used progressiveTimeout. But different requests may have transient failures for different reasons, so it's not obvious that we want this general throttle. And hard-to-trace bugs can creep in when the behavior of progressiveTimeout can't be determined from a particular call site. Also, progressiveTimeout uses a 60-second threshold as a heuristic to distinguish request retry loops from each other, with the simplifying assumption that different types of requests will not occur within 60 seconds of each other. This distinction is more effectively done by managing the state per-loop in the first place, and doing so eliminates more of those hard-to-trace bugs mentioned in the previous paragraph. So, introduce the ability to handle state locally. Capped exponential backoff still mitigates high request traffic, but per-request. Preparation for #3829, to be completed in this series of commits.

chrisbobbe · 2020-04-15T00:24:17Z

#3841 has been merged, clearing the way for work on this issue. Anyone interested in claiming this?

rk-for-zulip · 2020-04-16T00:44:46Z

Let's say me. (I'm already working on #3881, which has a fair range of intersection, and I think if someone else claimed it we'd be stepping on each other's toes a lot.)

gnprice added the a-compose/send Compose box, autocomplete, camera/upload, outbox, sending label Jan 21, 2020

gnprice mentioned this issue Jan 21, 2020

More sophisticated system for uploading images #3271

Closed

chrisbobbe mentioned this issue Jan 24, 2020

Clarify / improve progressiveTimeout #3840

Closed

chrisbobbe mentioned this issue Jan 27, 2020

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

Merged

gnprice mentioned this issue Feb 7, 2020

Sending outbox messages is fraught with issues #3881

Open

rk-for-zulip self-assigned this Apr 16, 2020

gnprice unassigned rk-for-zulip May 21, 2020

chrisbobbe mentioned this issue Jul 3, 2020

outbox: Show outbox messages when 'caughtUp.newer' is false. #4179

Closed

gnprice mentioned this issue Mar 15, 2021

Bulk-mark-as-read button is often unresponsive #4531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time out sending a message, and indicate failure #3829

Time out sending a message, and indicate failure #3829

gnprice commented Jan 21, 2020

gnprice commented Jan 21, 2020

rk-for-zulip commented Jan 21, 2020

chrisbobbe commented Jan 22, 2020 •

edited

chrisbobbe commented Jan 22, 2020 •

edited

gnprice commented Jan 24, 2020

chrisbobbe commented Jan 24, 2020 •

edited

chrisbobbe commented Apr 15, 2020

rk-for-zulip commented Apr 16, 2020

Time out sending a message, and indicate failure #3829

Time out sending a message, and indicate failure #3829

Comments

gnprice commented Jan 21, 2020

gnprice commented Jan 21, 2020

rk-for-zulip commented Jan 21, 2020

chrisbobbe commented Jan 22, 2020 • edited

chrisbobbe commented Jan 22, 2020 • edited

gnprice commented Jan 24, 2020

chrisbobbe commented Jan 24, 2020 • edited

chrisbobbe commented Apr 15, 2020

rk-for-zulip commented Apr 16, 2020

chrisbobbe commented Jan 22, 2020 •

edited

chrisbobbe commented Jan 22, 2020 •

edited

chrisbobbe commented Jan 24, 2020 •

edited