-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rate limiter improvement #2580
Conversation
I appreciate this PR, however, I am struggling to understand what is that it is solving. 🤔 |
@manast, consider the following two qualities of a rate limiter:
The current rate limiter do not fully have these qualities. Specifically, it does not make full use of the specified rate limit, and jobs are delayed unnecessarily. This is what's described in #2235, and we experience the same thing in production for our workload. Essentially jobs are moved into the delayed state, despite the queue processing jobs below the rate limit. Below is a scenario that shows the problem (admittedly designed to make matters worst). It ignores the
In conclusion, |
Thanks for the explanation. Ok so the issue is due to an optimization performed several years ago to minimize jobs being delayed too little and ending in a loop where they are rate-limited, delayed, go back to wait, then rate limit again, and so on, which in high load scenarios would make things really bad. So your fix reverts to the old implementation... and I am not sure this is good. The main issue here is that due to features like group keys for rate-limited jobs, it is not possible to actually "pause" the queue when it is rate limited, which would be the optimal way to solve this issue, i.e. as soon as the queue is rate limited, no jobs are processed at all, not even moved to the delay set, instead keep them waiting until the rate limiter expires. This is how rate-limiter is done in BullMQ now, but we had to sacrifice group keys, which is now a feature of the Pro version but properly implemented with virtual queues. So all in all, I am not sure this could be merged even if it will solve your particular case, it may actually even be the case you will suffer from what I exposed above if you get high loads in the future. |
@manast, thanks for the additional context. I could be mistaken, but I think there are key differences from the proposed implementation to the previous versions:
While it may appear that the proposed solution is similar "version 1", note that it uses the total number of rate limited jobs for delaying, not just the jobs processed in the current time window. This is a key difference, which I believe avoids the problem re-processing problem of #1110: jobs will be "spaced out" like for version 3, they are just delayed to the beginning of a limiter window, rather than being "distributed" within the windows. This has the benefit of avoiding unnecessary delays and better utilizing the full rate limit. In summary, I believe the proposed solution has the qualities I mentioned in my above comment and it avoids the problem of processing rate limited jobs over and over again. That being said, you're the expert, so I may very well have missed something 🙂 |
A couple of tests related to this change are failing. |
Thanks, I'll take a look! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
As described in #2235, the current rate limiter will "lag behind" on processing of jobs once the rate limit has been exceeded. This is because the current implementation will throttle jobs purely based whether there are any delayed jobs, and attempts to "space" out jobs according to the set
max
andduration
. However, this strategy means that if new jobs come in while processing jobs that have been delayed due to rate limiting, these jobs will be rate limited also, even if the new jobs would not result in the rate limit being exceeded.The diagram below tries to depict this scenario:
The proposed solution instead stacks up delayed jobs at the beginning of the next rate limiter "windows".
The current tests do not expose this behaviour, as this bug does not surface with
max = 1
. I've included tests with combinations ofmax
,duration
and job counts (numJobs
), which fail with the current implementation, but pass with the proposed solution. They rely on the fact that processingnumJobs
for a specificmax
andduration
should be processed in a duration from(Math.ceil(numJobs/max) - 1) * duration)
toMath.ceil(numJobs/max) * duration
.