New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconnect on known errors after failover when pushing jobs to Redis #5159
Conversation
Put a |
In the discussion thread where we talked about this, we also touched on extracting something re-usable here. This may become more important when potentially adding further configuration (e.g. for number of retries, and a delay) here. What do you think? |
I think it's worthwhile to extract the block but I don't want to extract configuration knobs because there is no right default value. I think it is ok as is unless someone can show a good reason otherwise. |
Thanks for the hint. I added comments in both places, so you don't miss updating both in the future when it needs changes. |
In a Redis cluster setup, failovers will happen. In these cases a `Redis::CommandError` can be raised for different reasons, for example when the server becomes a replica, when there is a "Not enough replicas" error from the primary, or when a blocking command is force-unblocked. These errors can occur when pushing a job to Redis, so it needs to reconnect to the current master node and retry. Otherwise, these jobs are lost. The retry logic is similar to the implementation for `Sidekiq.redis`.
In our app (@Dome-GER is a colleague 😁), a way to increase the retry count (and a delay in between) would already help. Our master is currently identified via DNS, which takes a bit longer to update, so the first retry often fails again. As for defaults, one retry and no delay (just like now) should work? (This puts the harder work on those, like us, who have slightly atypical setups.) |
Hmm, that's fair. If the "error" is a known edge case (e.g. migration failover) requiring specific settings, I think that's ok to specialize retry. Want to submit a PR which we can discuss? |
Sure. I was hoping you would say so... that's a good place to introduce a re-usable util then, too. |
This PR targets the discussion in #4990.
In a Redis cluster setup, failovers will happen. In these cases,
a
Redis::CommandError
can be raised for different reasons,for example, when the server becomes a replica, when there
is a "Not enough replicas" error from the primary, or when a
blocking command is force-unblocked.
Sample stacktrace
These errors can occur when pushing a job to Redis, so it needs
to reconnect to the current master node and retry.
Reconnecting to Redis is handled for
Sidekiq.redis
and has beenextended in #4985, but the Client's
#raw_push
method directlyaccesses the Redis connection pool, i.e. scheduling circumvents
Sidekiq.redis
.Therefore, proper retry logic (similar to
Sidekiq.redis
) is added.