Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rate limiter improvement #2580

Closed
wants to merge 1 commit into from

Conversation

andreas
Copy link

@andreas andreas commented Apr 14, 2023

As described in #2235, the current rate limiter will "lag behind" on processing of jobs once the rate limit has been exceeded. This is because the current implementation will throttle jobs purely based whether there are any delayed jobs, and attempts to "space" out jobs according to the set max and duration. However, this strategy means that if new jobs come in while processing jobs that have been delayed due to rate limiting, these jobs will be rate limited also, even if the new jobs would not result in the rate limit being exceeded.

The diagram below tries to depict this scenario:

The proposed solution instead stacks up delayed jobs at the beginning of the next rate limiter "windows".

The current tests do not expose this behaviour, as this bug does not surface with max = 1. I've included tests with combinations of max, duration and job counts (numJobs), which fail with the current implementation, but pass with the proposed solution. They rely on the fact that processing numJobs for a specific max and duration should be processed in a duration from(Math.ceil(numJobs/max) - 1) * duration) to Math.ceil(numJobs/max) * duration.

@manast
Copy link
Member

manast commented Apr 15, 2023

I appreciate this PR, however, I am struggling to understand what is that it is solving. 🤔

@andreas
Copy link
Author

andreas commented Apr 16, 2023

@manast, consider the following two qualities of a rate limiter:

  1. A rate limiter should make full use of the specified rate limit, i.e. if there are more than max jobs enqueued, then max jobs should be processed every duration (assuming processing time is not a bottleneck).
  2. A rate limiter should not delay the processing of jobs unnecessarily. A rate limited job should be processed as quickly as possible.

The current rate limiter do not fully have these qualities. Specifically, it does not make full use of the specified rate limit, and jobs are delayed unnecessarily. This is what's described in #2235, and we experience the same thing in production for our workload. Essentially jobs are moved into the delayed state, despite the queue processing jobs below the rate limit.

Below is a scenario that shows the problem (admittedly designed to make matters worst). It ignores the 1.1 drift compensation factor to the time calculations simpler.

  1. Consider a rate limiter max=3 and duration=60*60*1000 (1 hour), where each job takes 1m, and jobs are being processed serially (concurrency 1). Let's call the current time t0.
  2. t0: Enqueue 5 jobs. Let's call them j1 through j5.
  3. t0 + 0h1m: j1 completes.
  4. t0 + 0h2m: j2 completes.
  5. t0 + 0h3m: j3 completes. The rate limit is now exceeded, so j4 is delayed until t0 + 1h0m and j5 is delayed until t0 + 1h20m.
  6. t0 + 1h1m: j4 completes. Now, enqueue job j6, which will be delayed until t0 + 2h0m because j5 is delayed.

In conclusion, j5 is processed at t0 + 1h20m, but could be processed at t0 + 1h1m, so processing is delayed 19 min more than necessary. Further, j6 is processed at t0 + 2h0m, but could be processed at t0 + 1h2m, so processing is dealyed 58 min more than necessary 😨

@manast
Copy link
Member

manast commented Apr 16, 2023

Thanks for the explanation. Ok so the issue is due to an optimization performed several years ago to minimize jobs being delayed too little and ending in a loop where they are rate-limited, delayed, go back to wait, then rate limit again, and so on, which in high load scenarios would make things really bad. So your fix reverts to the old implementation... and I am not sure this is good. The main issue here is that due to features like group keys for rate-limited jobs, it is not possible to actually "pause" the queue when it is rate limited, which would be the optimal way to solve this issue, i.e. as soon as the queue is rate limited, no jobs are processed at all, not even moved to the delay set, instead keep them waiting until the rate limiter expires. This is how rate-limiter is done in BullMQ now, but we had to sacrifice group keys, which is now a feature of the Pro version but properly implemented with virtual queues.

So all in all, I am not sure this could be merged even if it will solve your particular case, it may actually even be the case you will suffer from what I exposed above if you get high loads in the future.

@andreas
Copy link
Author

andreas commented Apr 16, 2023

@manast, thanks for the additional context. I could be mistaken, but I think there are key differences from the proposed implementation to the previous versions:

  1. The version of the rate limiter prior to fix: better delay for rate limited jobs #1212 appears to only delay jobs to the next limiter window when jobCounter >= maxJobs. Here jobCounter refers to the number of jobs processed in the current window. This implementation exhibits the problems you mention about rate limited jobs being rate limited over and over again (issue Heap out of memory error on large number of queued DELAYED jobs #1110).
  2. After applying fix: better delay for rate limited jobs #1212, rate limited jobs are "spaced out" based on the number jobs processed in the current time window, i.e. (jobCounter - maxJobs) * duration / max. I assume this mitigates the re-processing problem of Heap out of memory error on large number of queued DELAYED jobs #1110.
  3. After applying feat: better rate limiter #1816, rate limited jobs are "spaced out" based on the total number of rate limited jobs and it adds the drift compensation factor, i.e. 1.1 * (numLimitedJobs - maxJobs) * duration / max.

While it may appear that the proposed solution is similar "version 1", note that it uses the total number of rate limited jobs for delaying, not just the jobs processed in the current time window. This is a key difference, which I believe avoids the problem re-processing problem of #1110: jobs will be "spaced out" like for version 3, they are just delayed to the beginning of a limiter window, rather than being "distributed" within the windows. This has the benefit of avoiding unnecessary delays and better utilizing the full rate limit.

In summary, I believe the proposed solution has the qualities I mentioned in my above comment and it avoids the problem of processing rate limited jobs over and over again. That being said, you're the expert, so I may very well have missed something 🙂

@manast
Copy link
Member

manast commented Apr 18, 2023

A couple of tests related to this change are failing.

@andreas
Copy link
Author

andreas commented Apr 18, 2023

Thanks, I'll take a look!

@stale
Copy link

stale bot commented Jun 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 17, 2023
@stale stale bot closed this Jun 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants