New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job getting lost in case of failed retry #4138
Comments
Can you supply the error and backtrace you are seeing?
… On Apr 3, 2019, at 09:19, Yuri Smirnov ***@***.***> wrote:
Ruby 2.6.2 with Sidekiq Pro 4.0.5.
I inspected the code and it turns out that in case of exception in Sidekiq::JobRetry#attempt_retry we are still acknowledging the job here which leads to the job being removed from Redis. If we didn't ack the job it would eventually get restored by the SuperFetch as I understand.
Maybe it would be better not to ack the job in case of all unhandled exceptions in Processor#process?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
There actually were 2 errors:
Then it tried to ack the job (perform LREM) and it succeeded since Redis was already started up. I know that probably in this particular case the reliable_push feature could have helped, but we are not using it in our app because in other cases it's much better to get an error than to lose a job because of Redis config being wrong for some reason. |
The fetch/ack process is distinct from the retry process so if Redis or the network blips at the right moment, the behavior can be unpredictable. The right solution would be to make those two actions one atomic action; that would require a major architectural overhaul. One possibly easy patch would be to retry the retry process N times if it fails with a RedisError; that would narrow the window further and might handle your situation where Redis disappears for a few seconds. |
OK, and maybe you can clarify why we should ack the job in case of unhandled exceptions in |
How do we distinguish between handled and unhandled? Once we get to |
But as far as I can see, exceptions that were handled by the retry block are not reraised into |
Yes, they are reraised here. Once we pass this point it is assumed the retry block inside |
Oh, now I see, thanks. Maybe we can wrap all handled exceptions in similar way as we do in |
Under just the right conditions, we could lose a job: - Job raises an error - Retry subsystem catches error and tries to create a retry in Redis but this raises a "Redis down" exception - Processor catches Redis exception and thinks a retry was created - Redis comes back online just in time for the job to be acknowledged and lost That's a very specific and rare set of steps but it can happen. Instead have the Retry subsystem raise a specific error signaling that it created a retry. There will be three common cases: 1. Job is successful: job is acknowledged. 2. Job fails, retry is created, Processor rescues specific error: job is acknowledged. 3. Sidekiq::Shutdown is raised: job is not acknowledged Now there is another case: 4. Job fails, retry fails, Processor rescues Exception: job is NOT acknowledged. Sidekiq Pro's super_fetch will rescue the orphaned job at some point in the future.
Moving to #4141 |
(#4141) Under just the right conditions, we could lose a job: - Job raises an error - Retry subsystem catches error and tries to create a retry in Redis but this raises a "Redis down" exception - Processor catches Redis exception and thinks a retry was created - Redis comes back online just in time for the job to be acknowledged and lost That's a very specific and rare set of steps but it can happen. Instead have the Retry subsystem raise a specific error signaling that it created a retry. There will be three common cases: 1. Job is successful: job is acknowledged. 2. Job fails, retry is created, Processor rescues specific error: job is acknowledged. 3. Sidekiq::Shutdown is raised: job is not acknowledged Now there is another case: 4. Job fails, retry fails, Processor rescues Exception: job is NOT acknowledged. Sidekiq Pro's super_fetch will rescue the orphaned job at some point in the future.
Ruby 2.6.2 with Sidekiq Pro 4.0.5.
I inspected the code and it turns out that in case of exception in
Sidekiq::JobRetry#attempt_retry
we are still acknowledging the job here which leads to the job being removed from Redis. If we didn't ack the job it would eventually get restored by the SuperFetch as I understand.Maybe it would be better not to ack the job in case of all unhandled exceptions in
Processor#process
?The text was updated successfully, but these errors were encountered: