Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect on known errors after failover when pushing jobs to Redis #5159

Merged
merged 1 commit into from Feb 2, 2022

Conversation

Dome-GER
Copy link
Contributor

This PR targets the discussion in #4990.

In a Redis cluster setup, failovers will happen. In these cases,
a Redis::CommandError can be raised for different reasons,
for example, when the server becomes a replica, when there
is a "Not enough replicas" error from the primary, or when a
blocking command is force-unblocked.

Sample stacktrace
  Redis::CommandError: READONLY You can't write against a read only replica.
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:216:in `call_pipelined'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:170:in `block in call_pipeline'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:313:in `with_reconnect'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:168:in `call_pipeline'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:2490:in `block in multi'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:2483:in `multi'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:189:in `block in raw_push'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:63:in `block (2 levels) in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `block in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `with'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:188:in `raw_push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:74:in `push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/worker.rb:240:in `client_push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/worker.rb:200:in `perform_async'
  from app/models/my_model.rb:49:in `schedule_sync!'

These errors can occur when pushing a job to Redis, so it needs
to reconnect to the current master node and retry.

Reconnecting to Redis is handled for Sidekiq.redis and has been
extended in #4985, but the Client's #raw_push method directly
accesses the Redis connection pool, i.e. scheduling circumvents
Sidekiq.redis.

Therefore, proper retry logic (similar to Sidekiq.redis) is added.

@mperham
Copy link
Collaborator

mperham commented Jan 31, 2022

Put a # this is copied from sidekiq.rb comment in both places so we know to update both places in the future should that code block need to change.

@franzliedke
Copy link
Contributor

In the discussion thread where we talked about this, we also touched on extracting something re-usable here. This may become more important when potentially adding further configuration (e.g. for number of retries, and a delay) here. What do you think?

@mperham
Copy link
Collaborator

mperham commented Feb 1, 2022

I think it's worthwhile to extract the block but I don't want to extract configuration knobs because there is no right default value. I think it is ok as is unless someone can show a good reason otherwise.

@Dome-GER
Copy link
Contributor Author

Dome-GER commented Feb 2, 2022

Thanks for the hint. I added comments in both places, so you don't miss updating both in the future when it needs changes.

lib/sidekiq/client.rb Outdated Show resolved Hide resolved
In a Redis cluster setup, failovers will happen. In these
cases a `Redis::CommandError` can be raised for different
reasons, for example when the server becomes a replica, when
there is a "Not enough replicas" error from the primary, or
when a blocking command is force-unblocked.

These errors can occur when pushing a job to Redis, so it needs
to reconnect to the current master node and retry. Otherwise,
these jobs are lost.

The retry logic is similar to the implementation for `Sidekiq.redis`.
@franzliedke
Copy link
Contributor

franzliedke commented Feb 3, 2022

I think it's worthwhile to extract the block but I don't want to extract configuration knobs because there is no right default value. I think it is ok as is unless someone can show a good reason otherwise.

In our app (@Dome-GER is a colleague 😁), a way to increase the retry count (and a delay in between) would already help. Our master is currently identified via DNS, which takes a bit longer to update, so the first retry often fails again.

As for defaults, one retry and no delay (just like now) should work? (This puts the harder work on those, like us, who have slightly atypical setups.)

@mperham
Copy link
Collaborator

mperham commented Feb 3, 2022

Hmm, that's fair. If the "error" is a known edge case (e.g. migration failover) requiring specific settings, I think that's ok to specialize retry. Want to submit a PR which we can discuss?

@franzliedke
Copy link
Contributor

Sure. I was hoping you would say so... that's a good place to introduce a re-usable util then, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants