New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis errors in acknowledge for super_fetch cause them to retry only after restart #4495
Comments
Sidekiq does automatically handle failover events: https://github.com/mperham/sidekiq/blob/6bd4eaffdce0b85ea387ef60782ffb7d7a2efeae/lib/sidekiq.rb#L98 It does not retry failed commands because that can lead to duplicate job pushes, leading to duplicate job execution. If you want to scan for orphans more often, I'd recommend you run |
Thanks for the reply and link to the failover handling; we get Regarding checking for orphans — I don't believe that would fix our problem. Please correct me if I'm wrong, but it seems like these do not create new "super processes" when the processor dies, so the jobs get stuck inside the still-live super process queue, which means that |
Opinion: I think synchronous replication is overkill for Sidekiq, typically job data is near-emphemeral: the common case is that it only exists for a few ms while the job executes. Sync replication adds a lot of overhead to that common case. But you know your usecase better than I, I would be happy to see that edge case covered by that block. However, if there aren't enough replicas to allow writes, how can Sidekiq do anything of use? |
This situation is temporary during the failover, so retrying after the replica is back can make it handle those more gracefully. |
What does that mean for the code? There are N threads, all working. Do they literally just |
Send me a PR that solves your case and we can discuss ramifications? |
In our application, we use https://github.com/ooyala/retries and we do occasionally see it taking 2 retries. I think it would need some delay — I wouldn't want it to loop forever, probably only for 1 or perhaps 2 retries. Do you have any ideas about |
The redis gem itself internally has a |
This might be vague hand waving but: SuperFetch is meant to prevent job loss. It does not guarantee a time window for recovering jobs because guarantees often don't work in a failing system. If the LREM fails, I can see how that might cause problems. Those jobs won't be recovered until the process is shut down, running check_for_orphans won't fix that. Can you give me a little background on why you are seeing failovers so often? We ran Redis for three years without any downtime or failures, it proved to be very reliable for us. |
We have an organization at my company that manages our database infrastructure including redis, and they frequently need to make changes; to move things around, perform upgrades, etc. |
I looked into |
Let me push back just a little bit and ask: can your org implement failovers in a more standard fashion, so that the existing handler works for you too? |
Would it be enough to add NOREPLICAS to that existing handler regexp? |
The READONLY error appears to be how the major Redis SaaSes implement failover: the existing primary becomes readonly to existing client connections. That's why I call it "more standard". |
I definitely agree that the way we handle failovers is non-standard and as a result may not make sense to handle in the gem. It's not feasible for me to change how we do failovers however. Perhaps there's a way to allow for custom error handling instead? Also, just to summarize on the abandoned jobs issue — it sounds like there's no better way to identify those than looping through the private queues and checking for old jobs? |
Also I suspect adding |
I would be ok with PRs that make it easier for you to implement your own custom failover handling. I'm not sure how to handle your super_fetch issue without specific logic in the acknowledge method. I would actually be ok with that if you want to email me private diffs with code that handles your situation. |
These errors can occur during Sidekiq's long-running job fetching command. This uses Redis' blocking BRPOP primitive. On failover in a cluster setup, these commands are interrupted by the server. This error causes the worker threads to be restarted, but as they are bubbled up to the top, they cause a lot of spam in our error logging systems. As related errors from other commands are being handled (see sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
These errors can occur during Sidekiq's long-running job fetching command. This uses Redis' blocking BRPOP primitive. On failover in a cluster setup, these commands are interrupted by the server. This error causes the worker threads to be restarted, but as they are bubbled up to the top, they cause a lot of spam in our error logging systems. As related errors from other commands are being handled (see sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
Ruby version: 2.6.5
Sidekiq / Pro / Enterprise version(s): sidekiq 5.2.7 / sidekiq-pro 4.0.5
Using an old version, but I don't see a fix listed in the sidekiq pro changelog.
Problem
We use super_fetch, and (because of failovers) it sometimes fails to
lrem
the job from the "local_queue" when acknowledging the job's success (and to make matters worse, the error is swallowed byprocessor_died
). When this happens, the job sits there until the sidekiq worker is shut down/rebooted, when it is re-enqueued bybulk_requeue
. We make sure that our jobs are idempotent so that it is fine for them to be re-enqueued, but the fact that it doesn't run again until we restart is somewhat problematic for us, because some of the services we interact with are only idempotent for a period of time (one example is 3 days, after which a retry will cause an additional side-effect); so if we don't restart our application for too long after a failover, we can run into duplicate side-effects.We can also detect these by inspecting all of the local/private queues and seeing if any have been enqueued longer than we expect our jobs to take — is there any more reliable way to determine if there is nothing actually running those jobs anymore and re-enqueueing them more aggressively?
I'm also curious: is it intentional that none of the redis calls have retries for connection errors? We have failovers regularly and if sidekiq retried some of these commands on connection errors it would those much smoother for us; not sure if I'm missing some technical reason why that would be dangerous.
The text was updated successfully, but these errors were encountered: