Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis errors in acknowledge for super_fetch cause them to retry only after restart #4495

Closed
mattbooks opened this issue Mar 18, 2020 · 17 comments

Comments

@mattbooks
Copy link

Ruby version: 2.6.5
Sidekiq / Pro / Enterprise version(s): sidekiq 5.2.7 / sidekiq-pro 4.0.5

Using an old version, but I don't see a fix listed in the sidekiq pro changelog.

Problem
We use super_fetch, and (because of failovers) it sometimes fails to lrem the job from the "local_queue" when acknowledging the job's success (and to make matters worse, the error is swallowed by processor_died). When this happens, the job sits there until the sidekiq worker is shut down/rebooted, when it is re-enqueued by bulk_requeue. We make sure that our jobs are idempotent so that it is fine for them to be re-enqueued, but the fact that it doesn't run again until we restart is somewhat problematic for us, because some of the services we interact with are only idempotent for a period of time (one example is 3 days, after which a retry will cause an additional side-effect); so if we don't restart our application for too long after a failover, we can run into duplicate side-effects.

We can also detect these by inspecting all of the local/private queues and seeing if any have been enqueued longer than we expect our jobs to take — is there any more reliable way to determine if there is nothing actually running those jobs anymore and re-enqueueing them more aggressively?

I'm also curious: is it intentional that none of the redis calls have retries for connection errors? We have failovers regularly and if sidekiq retried some of these commands on connection errors it would those much smoother for us; not sure if I'm missing some technical reason why that would be dangerous.

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

Sidekiq does automatically handle failover events: https://github.com/mperham/sidekiq/blob/6bd4eaffdce0b85ea387ef60782ffb7d7a2efeae/lib/sidekiq.rb#L98

It does not retry failed commands because that can lead to duplicate job pushes, leading to duplicate job execution.

If you want to scan for orphans more often, I'd recommend you run Sidekiq::Pro::SuperFetch.check_for_orphans as an hourly periodic job. I think it would be a good idea for Sidekiq Enterprise's cron to automatically do this.

@mattbooks
Copy link
Author

Thanks for the reply and link to the failover handling; we get NOREPLICAS Not enough good slaves to write. errors that don't get caught by that — would that be something we could add?

Regarding checking for orphans — I don't believe that would fix our problem. Please correct me if I'm wrong, but it seems like these do not create new "super processes" when the processor dies, so the jobs get stuck inside the still-live super process queue, which means that check_for_orphans does not pick them up (because they live in a super process that is still registered under super_processes in redis).

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

Opinion: I think synchronous replication is overkill for Sidekiq, typically job data is near-emphemeral: the common case is that it only exists for a few ms while the job executes. Sync replication adds a lot of overhead to that common case.

But you know your usecase better than I, I would be happy to see that edge case covered by that block. However, if there aren't enough replicas to allow writes, how can Sidekiq do anything of use?

@mattbooks
Copy link
Author

However, if there aren't enough replicas to allow writes, how can Sidekiq do anything of use?

This situation is temporary during the failover, so retrying after the replica is back can make it handle those more gracefully.

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

What does that mean for the code? There are N threads, all working. Do they literally just loop ... sleep forever inside the redis method?

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

Send me a PR that solves your case and we can discuss ramifications?

@mattbooks
Copy link
Author

In our application, we use https://github.com/ooyala/retries and we do occasionally see it taking 2 retries. I think it would need some delay — I wouldn't want it to loop forever, probably only for 1 or perhaps 2 retries.

Do you have any ideas about check_for_orphans not handling this case?

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

The redis gem itself internally has a reconnect_attempts option. Take a look at it, will that work?

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

This might be vague hand waving but: SuperFetch is meant to prevent job loss. It does not guarantee a time window for recovering jobs because guarantees often don't work in a failing system. If the LREM fails, I can see how that might cause problems. Those jobs won't be recovered until the process is shut down, running check_for_orphans won't fix that.

Can you give me a little background on why you are seeing failovers so often? We ran Redis for three years without any downtime or failures, it proved to be very reliable for us.

@mattbooks
Copy link
Author

We have an organization at my company that manages our database infrastructure including redis, and they frequently need to make changes; to move things around, perform upgrades, etc.

@mattbooks
Copy link
Author

I looked into reconnect_attempts — it won't work since it connects just fine, but can't write; the fix would need to wait until it became writeable again.

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

Let me push back just a little bit and ask: can your org implement failovers in a more standard fashion, so that the existing handler works for you too?

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

Would it be enough to add NOREPLICAS to that existing handler regexp?

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

The READONLY error appears to be how the major Redis SaaSes implement failover: the existing primary becomes readonly to existing client connections. That's why I call it "more standard".

@mattbooks
Copy link
Author

I definitely agree that the way we handle failovers is non-standard and as a result may not make sense to handle in the gem. It's not feasible for me to change how we do failovers however.

Perhaps there's a way to allow for custom error handling instead?

Also, just to summarize on the abandoned jobs issue — it sounds like there's no better way to identify those than looping through the private queues and checking for old jobs?

@mattbooks
Copy link
Author

Also I suspect adding NOREPLICAS to the existing handler would have a limited effect — If I'm understanding correctly, the reconnect isn't that helpful in our case, it would only be the delay that the code caused, and since reconnect is probably fast it would likely not fix most cases.

@mperham
Copy link
Collaborator

mperham commented Mar 18, 2020

I would be ok with PRs that make it easier for you to implement your own custom failover handling. I'm not sure how to handle your super_fetch issue without specific logic in the acknowledge method. I would actually be ok with that if you want to email me private diffs with code that handles your situation.

franzliedke added a commit to franzliedke/sidekiq that referenced this issue Sep 6, 2021
These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
franzliedke added a commit to franzliedke/sidekiq that referenced this issue Sep 6, 2021
These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
sidekiq#2550 and sidekiq#4495) this way, it seems senbile to also handle this one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants