Reconnect on known errors after failover when pushing jobs to Redis #5159

Dome-GER · 2022-01-31T15:06:35Z

This PR targets the discussion in #4990.

In a Redis cluster setup, failovers will happen. In these cases,
a Redis::CommandError can be raised for different reasons,
for example, when the server becomes a replica, when there
is a "Not enough replicas" error from the primary, or when a
blocking command is force-unblocked.

Sample stacktrace

  Redis::CommandError: READONLY You can't write against a read only replica.
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:216:in `call_pipelined'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:170:in `block in call_pipeline'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:313:in `with_reconnect'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:168:in `call_pipeline'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:2490:in `block in multi'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:2483:in `multi'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:189:in `block in raw_push'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:63:in `block (2 levels) in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `block in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `with'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:188:in `raw_push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/client.rb:74:in `push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/worker.rb:240:in `client_push'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/worker.rb:200:in `perform_async'
  from app/models/my_model.rb:49:in `schedule_sync!'

These errors can occur when pushing a job to Redis, so it needs
to reconnect to the current master node and retry.

Reconnecting to Redis is handled for Sidekiq.redis and has been
extended in #4985, but the Client's #raw_push method directly
accesses the Redis connection pool, i.e. scheduling circumvents
Sidekiq.redis.

Therefore, proper retry logic (similar to Sidekiq.redis) is added.

mperham · 2022-01-31T16:26:36Z

Put a # this is copied from sidekiq.rb comment in both places so we know to update both places in the future should that code block need to change.

franzliedke · 2022-02-01T12:12:32Z

In the discussion thread where we talked about this, we also touched on extracting something re-usable here. This may become more important when potentially adding further configuration (e.g. for number of retries, and a delay) here. What do you think?

mperham · 2022-02-01T23:02:49Z

I think it's worthwhile to extract the block but I don't want to extract configuration knobs because there is no right default value. I think it is ok as is unless someone can show a good reason otherwise.

Dome-GER · 2022-02-02T15:46:37Z

Thanks for the hint. I added comments in both places, so you don't miss updating both in the future when it needs changes.

lib/sidekiq/client.rb

In a Redis cluster setup, failovers will happen. In these cases a `Redis::CommandError` can be raised for different reasons, for example when the server becomes a replica, when there is a "Not enough replicas" error from the primary, or when a blocking command is force-unblocked. These errors can occur when pushing a job to Redis, so it needs to reconnect to the current master node and retry. Otherwise, these jobs are lost. The retry logic is similar to the implementation for `Sidekiq.redis`.

franzliedke · 2022-02-03T14:11:50Z

I think it's worthwhile to extract the block but I don't want to extract configuration knobs because there is no right default value. I think it is ok as is unless someone can show a good reason otherwise.

In our app (@Dome-GER is a colleague 😁), a way to increase the retry count (and a delay in between) would already help. Our master is currently identified via DNS, which takes a bit longer to update, so the first retry often fails again.

As for defaults, one retry and no delay (just like now) should work? (This puts the harder work on those, like us, who have slightly atypical setups.)

mperham · 2022-02-03T16:39:04Z

Hmm, that's fair. If the "error" is a known edge case (e.g. migration failover) requiring specific settings, I think that's ok to specialize retry. Want to submit a PR which we can discuss?

franzliedke · 2022-02-03T17:01:21Z

Sure. I was hoping you would say so... that's a good place to introduce a re-usable util then, too.

Dome-GER force-pushed the main branch from ee45f2a to 07e5c25 Compare February 2, 2022 15:45

mperham reviewed Feb 2, 2022

View reviewed changes

lib/sidekiq/client.rb Outdated Show resolved Hide resolved

Dome-GER force-pushed the main branch from 07e5c25 to f3c9fc8 Compare February 2, 2022 18:02

Dome-GER force-pushed the main branch from f3c9fc8 to bea7638 Compare February 2, 2022 18:05

mperham merged commit 9627cbb into sidekiq:main Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect on known errors after failover when pushing jobs to Redis #5159

Reconnect on known errors after failover when pushing jobs to Redis #5159

Dome-GER commented Jan 31, 2022

mperham commented Jan 31, 2022 •

edited

franzliedke commented Feb 1, 2022

mperham commented Feb 1, 2022

Dome-GER commented Feb 2, 2022

franzliedke commented Feb 3, 2022 •

edited

mperham commented Feb 3, 2022

franzliedke commented Feb 3, 2022

Reconnect on known errors after failover when pushing jobs to Redis #5159

Reconnect on known errors after failover when pushing jobs to Redis #5159

Conversation

Dome-GER commented Jan 31, 2022

mperham commented Jan 31, 2022 • edited

franzliedke commented Feb 1, 2022

mperham commented Feb 1, 2022

Dome-GER commented Feb 2, 2022

franzliedke commented Feb 3, 2022 • edited

mperham commented Feb 3, 2022

franzliedke commented Feb 3, 2022

mperham commented Jan 31, 2022 •

edited

franzliedke commented Feb 3, 2022 •

edited