New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random queues get ignored a short while after sidekiq process starts. #5031
Comments
I should mention we have other services using sidekiq 6.2.x that are apparently unaffected by this. We've been trying to compare and contrast them with this service (newman), but so far nothing seems to really be different. The issue only showed up about two weeks ago, coinciding with when I upgraded the sidekiq version. |
Are you using super fetch? If not, Any chance you can bisect to narrow down the commit? |
Neither project (the one that works and the one that doesn't) is using |
If you can reliably reproduce it, bisecting would allow me to fix it quickly. Or if you can determine the first version it appears in, that would narrow things down a lot. |
Sorry for the delay, I've narrowed it down to fce05c9. I can reliably produce the error locally with that commit, but after four attempts have been unable to reproduce it with the commit just before. |
Right, that was a major refactor. Are you using any 3rd party Sidekiq plugins which might be incompatible and need updating too? |
In the project I'm consistently noticing the behavior, we actually were using:
However, I was able to remove all those gems, including pro & enterprise, and just run regular ol' sidekiq and get the same behavior on that commit. |
I was able to replicate this issue with a fresh rails app and jruby 9.2.6.0: https://github.com/mogman1/idle_queue_demo. |
Thank you, I will take a look today.
…On Thu, Oct 21, 2021 at 09:46 Shaun Carlson ***@***.***> wrote:
I was able to replicate this issue with a fresh rails app and jruby
9.2.6.0: https://github.com/mogman1/idle_queue_demo.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5031 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAWX536IC5A4HWWEBY63LUIA7XBANCNFSM5GKH4TEQ>
.
|
I can't get 9.2.6.0 or 9.2.9.0 working with this app. Here's the 9290 bug:
I will try to reproduce it on MRI 2.7. |
Huh. For what it's worth, we run openjdk Java8 for JRuby. |
FWIW I ran 500 jobs via your steps many times on several different Sidekiq processes using MRI 2.72 and didn't see any lockups. Here's my jruby:
|
I will upgrade jruby and see if the latest works. |
Ok, got it running and reproduced it. I believe switching to strict priority fixes it, so that's an option:
I believe the issue is a thread race condition in the way the weighted queues are scrambled on each fetch. I suspect the bug only happens on jruby because of its true parallelism. |
Can you test main now and see if it fixes your issue? |
Pulling down the latest, I can no longer reproduce the issue in my test script nor in our internal service, so that seems to have done it. What I find odd, though, is that based on the change you made, I would think 6.0.7 would still have been impacted by this (that same code existed then, too). So I'm kind of at a loss as to why. |
Great news. It’s possible JRuby made changes which made this race more
likely.
…On Thu, Oct 21, 2021 at 16:00 Shaun Carlson ***@***.***> wrote:
Pulling down the latest, I can no longer reproduce the issue in my test
script nor in our internal service, so that seems to have done it. What I
find odd, though, is that based on the change you made, I would think 6.0.7
would still have been impacted by this (that same code existed then, too).
So I'm kind of at a loss as to why.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#5031 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAWXZEYJQXZ67GLW2T5OTUICLQNANCNFSM5GKH4TEQ>
.
|
Environment(s)
Ruby version: JRuby 9.2.6.0
Rails version: 6.0.4.1
Sidekiq / Pro / Enterprise version(s):
Sidekiq: 6.1.3
Pro: 5.2.1
Ent: 2.2.2
Sidekiq: 6.2.1
Pro: 5.2.4
Ent: 2.2.3
Sidekiq: 6.2.2
Pro: 5.2.4
Ent: 2.2.3
Description
Two weeks ago I upgraded from 6.0.7 to 6.2.2 and began noticing that sometimes one or more of my queues would stop getting processed. I have three different queues and prior to the upgrade, they would get pulled from just fine, but now it seems like they get "stuck". If I restart the sidekiq process, workers will pull jobs from all queues just fine until they zero out, then one or two queues will just get ignored while work builds up in them. The other queue will still have jobs pulled from it. It isn't consistent which queues get stuck and which get work pulled from them.
In our deployed envs, we have multiple servers running sidekiq and pulling jobs off the queue. From watching the logs, it appears that certain processes latch onto certain queues and will pull jobs off that queue and seemingly ignore the others, despite having identical startup commands.
When I rolled back to 6.0.7, this problem no longer occurred and all queues would be processed as expected. We also tried changing up the redis gem version, but nothing changed when it was 4.3.1, 4.4.0, or 4.5.1. If when a process seemed to get stuck only processing one or two queues, if I started a rails console and manually pulled from redis using brpop, I would successfully pull a job from any queue. I also tried creating a simple loop in the console that continuously called brpop and I could never see it get stuck like sidekiq seems to.
When looking at the stack traces after sending TTIN, all workers appear to be waiting to read from socket in the brpop command.
#5028 sounds suspiciously like what I'm experiencing, though I don't know for sure. The process is still running just fine, and whenever a job shows up in the queue the process happens to be watching (even though it should be watching all), it will still pull it off, so I don't think #5029 is related.
TTIN stacktrace output
The text was updated successfully, but these errors were encountered: