Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makeBackoffMachine: Improve backoff sleep logic by making it local. #3841

Merged
merged 6 commits into from Apr 10, 2020

Conversation

chrisbobbe
Copy link
Contributor

@chrisbobbe chrisbobbe commented Jan 25, 2020

Fixes: #3840.

Replace progressiveTimeout with makeBackoffMachine, which is called
above a loop, and .wait() can be called inside the loop on the
returned backoffMachine. Give access to numSleepsCompleted and
timeElapsedSoFar so "give up" logic (e.g., on too many network
requests with transient failures) can be handled near .wait() call
sites. Prepare for work on #3829 (a permanent timeout on sending
outbox messages).

Previously, the state logic in progressiveTimeout was global, which
meant a potentially desirable general throttle on all network
requests that used it, but the drawbacks were greater. A particular
call to progressiveTimeout was nondeterministic, because other calls
may have been made recently, affecting the resulting delay duration.
A 60-second threshold was used as a heuristic to distinguish request
retry loops from each other, but this is much more effectively done
with the new interface.

@chrisbobbe chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from c126fb7 to ed75f0f Compare January 27, 2020 21:52
Copy link
Contributor

@rk-for-zulip rk-for-zulip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename commit feels like a misstep. makeBackoffMachine has a different API and no shared implementation details; it would be more straightforward to:

  • create makeBackoffMachine directly in async.js, with new tests;
  • replace the various uses of progressiveTimeout; and
  • remove progressiveTimeout.

src/utils/async.js Outdated Show resolved Hide resolved
src/utils/async.js Outdated Show resolved Hide resolved
src/utils/async.js Outdated Show resolved Hide resolved
src/utils/async.js Outdated Show resolved Hide resolved
src/utils/async.js Outdated Show resolved Hide resolved
src/utils/async.js Show resolved Hide resolved
src/utils/__tests__/makeBackoffMachine-test.js Outdated Show resolved Hide resolved
src/utils/async.js Outdated Show resolved Hide resolved
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Jan 31, 2020

The rename commit feels like a misstep. makeBackoffMachine has a different API and no shared implementation details; it would be more straightforward to:

  • create makeBackoffMachine directly in async.js, with new tests;
  • replace the various uses of progressiveTimeout; and
  • remove progressiveTimeout.

Mm, that approach makes a lot more sense than what I did! Even considering that the rename commit still preserves functionality and achieves a pretty minimal diff, but at the cost of being semantically incorrect, as noted in the commit message. Thanks for proposing this strategy, which avoids that awkward hiccup.

@chrisbobbe
Copy link
Contributor Author

This would definitely be safer with Lolex. (Ref: #3727.)

There should probably also be tests for the backoff machine's auxiliary functions here.

OK! I'll await #3727 before switching to Lolex and adding those additional tests, since it looks like there's some question about what file the Lolex shim will eventually be in.

@chrisbobbe chrisbobbe added the blocked on other work To come back to after another related PR, or some other task. label Jan 31, 2020
@chrisbobbe chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from 24d8fc8 to 212d261 Compare January 31, 2020 19:34
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Jan 31, 2020

I just reorganized my commits, but you may want to postpone another review until I've added the Lolex changes (pending resolution of that PR) and those additional tests you mentioned; up to you. Also, the actual resolution of the bug you spotted with Date.now() - undefined, is now done on Greg's recommendation by just removing the timeElapsedSinceFirstWait getter because it is not used anywhere.

@chrisbobbe
Copy link
Contributor Author

Removed the commit repealing the no-undef-init rule, and fixed the startTime declaration to comply, as discussed yesterday with Ray.

@rk-for-zulip
Copy link
Contributor

To minimize blockage, the Lolex timer work has been separated from the presence-heartbeat work into a new PR, #3886.

@chrisbobbe
Copy link
Contributor Author

Following discussion about private fields on classes, resolved in https://chat.zulip.org/#narrow/stream/243-mobile-team/topic/private.20class.20fields/near/815229, I'm happy to turn this into a class BackoffMachine if that's desired.

@gnprice
Copy link
Member

gnprice commented Feb 20, 2020

The other day I happened to learn about some science! :atom: that shows there's a way to choose how long to wait before retry which is much better than the pure exponential version we've been using:
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

The takeaway is it's important to add "jitter", and specifically a quite radical amount of jitter, like this:

sleep = random_between(0, min(cap, base * 2 ** attempt))

Analysis in the linked post, and it makes all kinds of sense in retrospect.

I think this will be a pretty easy tweak to the code in this branch 🙂 when we come back to it -- the bulk of the work here is basically orthogonal to the particular formula used.

@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Feb 20, 2020

That’s great!! I’d seen jitter being used in some examples I found while working on this, but didn’t see the performance data or the different possible ways to use jitter.

@chrisbobbe chrisbobbe force-pushed the improve-progressive-timeout branch 2 times, most recently from 36ad8b1 to a94febf Compare February 20, 2020 20:07
@chrisbobbe
Copy link
Contributor Author

OK, I've pushed a commit that adds jitter. I've also (hopefully) clarified the commit message for the commit that introduces makeBackoffMachine.

src/utils/async.js Outdated Show resolved Hide resolved
@gnprice
Copy link
Member

gnprice commented Feb 25, 2020

Lolex is now merged, via #3886 -- so this is now unblocked to use that.

@gnprice gnprice removed the blocked on other work To come back to after another related PR, or some other task. label Feb 25, 2020
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Feb 25, 2020

> Lolex is now merged, via #3886 -- so this is now unblocked to use that.

I'll pause in case there's cleanup work for 6d1d1df#r37482253. (This was a false alarm; I'm OK to proceed.)

@chrisbobbe
Copy link
Contributor Author

And, now with Lolex! Thanks, @ray-kraesig and @gnprice!

@@ -32,6 +32,8 @@ rules:
# Formatting and naming
camelcase: off

no-await-in-loop: off # Promise.all is not always desirable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message. Our existing improper uses of array.forEach seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)

Which is not to say that I'm against removing this lint! I suspect we almost invariably do want sequential await whenever we have multiple Awaitables, simply because we rarely – if ever – want to fire off a hundred network requests at once.

Copy link
Contributor Author

@chrisbobbe chrisbobbe Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how turning this off will prevent – or even make less likely – the failure state you describe in the commit message.

Well, with this rule, if you want to await sequentially in a loop, you have to choose between an eslint-disable and the failure of using a forEach. I think it's reasonable to assume that someone will choose wrong because they trust our choice of rules and they're not looking closely at the implementation of forEach. I'll see if I can make this clearer in the commit message.

Our existing improper uses of array.forEach seem more likely to have been induced by AirBnB's anti-loop position, which we've since thoughtfully revoked. (See the bottom of this file.)

That may be true, but it seems anecdotal; you're talking about the outbox issue, right, which I didn't know about when I committed this; in fact, my reasoning doesn't depend on any improper uses I saw. It sounds like you're not opposed to repealing no-await-in-loop, anyway, but I hope the reason I'd like to do so is clear.

src/utils/async.js Outdated Show resolved Hide resolved
src/utils/__tests__/makeBackoffMachine-test.js Outdated Show resolved Hide resolved
Comment on lines 47 to 48
expect(minFromAllTrials).toBeLessThan(expectedMax * 0.25);
expect(maxFromAllTrials).toBeGreaterThan(expectedMax * 0.75);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first reaction on seeing this is "wow, that's likely to flake someday, isn't it?"

Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of 2 * (1-(1-(0.25**100))**11)1 / 7.3e58?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is of course wrong. But perhaps this could use a comment noting that the chances of this flaking are on the order of 2 * (1-(1-(0.25**100))**11)1 / 7.3e58?

I have no idea what I was thinking here with 0.25**100 rather than 0.75**100. (The rest is due to considering the aggregate flake chances rather than individual assertion chance failure. I'm skeptical of 2 *, in retrospect, but it's probably at least close to that.)

... anyway, the comment you've added in the current head is 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... anyway, the comment you've added in the current head is 👍.

Many thanks to @gnprice for this, I can't take the credit! 😄

src/utils/async.js Outdated Show resolved Hide resolved
The docs at https://eslint.org/docs/rules/no-await-in-loop say,
"Usually, the code should be refactored to create all the promises
then get access to the results using `Promise.all()`. Otherwise, each
successive operation will not start until the previous one has
completed."

But sometimes invoking one asynchronous task depends on the
resolution of the previous task, or it's important to have only one
task in progress at a time. If either of these is important, the
programmer must use an `eslint-disable` within their `for` or
`while` loop.

The only other choice that seems reasonable is to use
Array.prototype.forEach, but this will introduce a bug: A promise
returned by the callback passed to forEach is not `await`ed before
the next invocation of the callback.

['a', 'b', 'c'].forEach(async letter => {
  await new Promise(resolve => setTimeout(resolve, 1000));
  console.log(letter);
});

The desired result is

// (one second passes)
// 'a'
// (one second passes)
// 'b'
// (one second passes)
// 'c'

but the actual result is

// (one second passes)
// 'a'
// 'b'
// 'c'

With this rule, if you want to `await` sequentially in a loop, you
have to choose between an `eslint-disable` and the failure of using
a forEach. It's reasonable to assume that someone will choose wrong
because they trust our choice of rules and they're not looking
closely at the implementation of forEach. So, we're better without
it.
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Mar 4, 2020

I've made the requested changes, and an additional change that Greg and I discussed (making it into a class BackoffMachine for readability). I also moved the BackoffMachine tests into async-test.js, since BackoffMachine is and has been in async.js.

This isn't ready to be merged in its current state. There seems to be Lolex-related interference between the existing tests in async-test and the ones I added; I'm seeing failures on both the existing sleep (ideal) › waits for exactly the right number of milliseconds test and my added BackoffMachine › timeouts are random from zero to 100ms, 200ms, 400ms, 800ms... test, with this message:

Timeout - Async callback was not invoked within the 5000ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000ms timeout specified by jest.setTimeout.Error:

The failures in the newly added tests go away when the old tests are commented out, and vice versa.

@rk-for-zulip
Copy link
Contributor

I also moved the BackoffMachine tests into async-test.js, since BackoffMachine is and has been in async.js.

And that's what triggered it. Jest is running both Lolex-based tests in the same environment in parallel. 😞

One would hope this would be resolved out of the box in Jest 26; it might even be safe in Jest 25.1.0 with a custom environment. In the meantime, I think placing the BackoffMachine tests in their own file (with a note describing why) should suffice as a workaround for now.

@chrisbobbe
Copy link
Contributor Author

Argh, Jest!! OK, I'll put the tests back in their own file.

Or, argh, JavaScript? On the face of it, I wouldn't expect any interference, because the two tests running in parallel each use their own Lolex object (const lolex: Lolex = new Lolex()). I'd like that to be sufficient to ensure that they don't interfere, but I suspect that if I dug deeply enough, I'd find that those two objects are sharing the use of some bit of state. Do you think I would encounter the same unfulfilled expectation in most other programming languages, or is this just another lovely JavaScript quirk?

Chris Bobbe added 5 commits March 4, 2020 10:59
Improve state logic for sleep durations in API request retry loops.

progressiveTimeout uses a global state, with the effect that one
network request's retry attempts affect the sleep durations used for
any other request. This has the benefit of enabling a general
throttle on all request retry loops that used progressiveTimeout.
But different requests may have transient failures for different
reasons, so it's not obvious that we want this general throttle. And
hard-to-trace bugs can creep in when the behavior of
progressiveTimeout can't be determined from a particular call site.

Also, progressiveTimeout uses a 60-second threshold as a heuristic
to distinguish request retry loops from each other, with the
simplifying assumption that different types of requests will not
occur within 60 seconds of each other. This distinction is more
effectively done by managing the state per-loop in the first place,
and doing so eliminates more of those hard-to-trace bugs mentioned
in the previous paragraph.

So, introduce the ability to handle state locally. Capped
exponential backoff still mitigates high request traffic, but
per-request.

Preparation for zulip#3829, to be completed in this series of commits.
The benefits of "jitter" are large, and the implementation is not
complex. With plain capped exponential backoff, if there is a burst
of client requests, they will all back off together, resulting in
more bursts, with gaps in between. Introducing randomness,
particularly a lot of randomness, smooths out those bursts.

Greg refers to this article:
https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
and he discusses the advantages of "jitter" in more detail at
https://chat.zulip.org/#narrow/stream/92-learning/topic/exponential.20backoff.20--.20with.20jitter.
There, he claims that the work to be done with "full jitter" is O(N log N),
a great improvement over O(N^2) without any jitter.
Enable the use of makeBackoffMachine in tryUntilSuccessful. We need
to call `new BackoffMachine()` in the body of tryUntilSuccessful,
but above the loop. This isn't possible when tryUntilSuccessful is
recursive.

Now we can call `new BackoffMachine()` inside the function but above the
loop.
@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Mar 4, 2020

OK, I've put them back into backoffMachine-test.js.

EDIT: Um, oops? Does Command+Enter trigger "Close Pull Request" instead of "Comment"??

@chrisbobbe chrisbobbe closed this Mar 4, 2020
@chrisbobbe chrisbobbe reopened this Mar 4, 2020
@rk-for-zulip
Copy link
Contributor

Argh, Jest!! OK, I'll put the tests back in their own file.

Or, argh, JavaScript?

I could make a decent argument that it's "Argh, Ray!" I didn't write the core of our Lolex shim, but I did write most of the shell.

On the face of it, I wouldn't expect any interference, because the two tests running in parallel each use their own Lolex object (const lolex: Lolex = new Lolex()). I'd like that to be sufficient to ensure that they don't interfere, but I suspect that if I dug deeply enough, I'd find that those two objects are sharing the use of some bit of state.

You don't need to dig too deep. For starters, they're both (presumably?) modifying the same global Date object.

This problem is actually noted in the Lolex shim. I wasn't expecting this use case to fail either, though; I hadn't considered that Jest might run different test-groups in the same environment in parallel.

I don't know how Jest v26 is supposed to handle this – but then, I don't know how Jest handles mocking the setTimeout global binding in some tests but not others today. (I can think of at least two ways, and both of them are extensible to handle Date.now().)

Do you think I would encounter the same unfulfilled expectation in most other programming languages, or is this just another lovely JavaScript quirk?

Depends on exactly what the unfulfilled expectation in question is, I think. It's certainly possible to see similar behavior in any dynamic language with a mutable global environment.

@rk-for-zulip
Copy link
Contributor

Sorry about the delay! Not sure how this one fell off my radar.

Looks good. Merged!

@chrisbobbe
Copy link
Contributor Author

chrisbobbe commented Apr 10, 2020

No problem! Thanks!! 🙂

chrisbobbe referenced this pull request Apr 11, 2020
We disabled this rule for the whole tree in 4fcfb76; so now
we get to delete these lines that disabled it in specific places
it had popped up.
@chrisbobbe chrisbobbe deleted the improve-progressive-timeout branch November 6, 2020 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify / improve progressiveTimeout
4 participants