Reconnect on "UNBLOCKED force unblock" errors #4985

franzliedke · 2021-09-06T19:58:43Z

Thanks a lot for Sidekiq!

In our setup with a Redis cluster, we are having regular failovers,
which is fine. We are currently improving the way our application
handles these situations and stumbled across a possible improvement
in Sidekiq itself.

By far the most frequent error in our error monitoring system is the
following:

UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)

These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
#2550 and #4495) this way, it seems senbile to also handle this one.

Details

Stacktrace

Redis::CommandError: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:132:in `call'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:226:in `block in call_with_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:300:in `with_socket_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:225:in `call_with_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1224:in `block in _bpop'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1221:in `_bpop'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1266:in `brpop'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/fetch.rb:47:in `block in retrieve_work'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq.rb:98:in `block in redis'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:63:in `block (2 levels) in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `block in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `with'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq.rb:95:in `redis'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/fetch.rb:47:in `retrieve_work'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:83:in `get_one'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:95:in `fetch'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:77:in `process_one'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:68:in `run'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/util.rb:43:in `watchdog'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/util.rb:52:in `block in safe_thread'
  from vendor/ruby/2.7.0/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'

mperham · 2021-09-06T20:01:42Z

Any reason not to trigger on the UNBLOCKED code, just like the others?

franzliedke · 2021-09-06T20:06:21Z

I wasn't sure about that part, either - the part I am matching on seems more related to the failover scenario, which seems to be what we're interested in. I don't know what other scenarios could lead to such an error.

mperham · 2021-09-07T15:59:47Z

I think the uppercase word is the "error code". The message after it is informative and could be translated into other languages, for instance. We should perform logic on the error code only.

These errors can occur during Sidekiq's long-running job fetching command. This uses Redis' blocking BRPOP primitive. On failover in a cluster setup, these commands are interrupted by the server. This error causes the worker threads to be restarted, but as they are bubbled up to the top, they cause a lot of spam in our error logging systems. As related errors from other commands are being handled (see

franzliedke · 2021-09-07T18:29:43Z

That sounds sensible. Changed it.

franzliedke · 2021-09-07T18:30:42Z

Oh, one more thing: I also considered handling this directly in the Processor code that executes the blocking BRPOP. Would you prefer that? I don't know enough about Sidekiq internals to say whether this may be relevant in other places.

mperham · 2021-09-07T20:56:15Z

I think it's better to have a blanket policy to reset the connection upon any of these error codes.

franzliedke · 2021-09-08T16:57:28Z

Thanks a lot!

@mperham I read somewhere you only do releases every few months. That is totally fine, of course. Is the main branch considered stable, though?

mperham · 2021-09-08T17:00:21Z

Yes, main is stable.

franzliedke force-pushed the fl/failover-unblock branch from 8741fa3 to 7faabd0 Compare September 6, 2021 19:59

franzliedke force-pushed the fl/failover-unblock branch from 7faabd0 to cb0b516 Compare September 7, 2021 18:29

mperham merged commit 237c70f into sidekiq:master Sep 7, 2021

franzliedke deleted the fl/failover-unblock branch September 14, 2021 21:15

Dome-GER mentioned this pull request Jan 31, 2022

Reconnect on known errors after failover when pushing jobs to Redis #5159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect on "UNBLOCKED force unblock" errors #4985

Reconnect on "UNBLOCKED force unblock" errors #4985

franzliedke commented Sep 6, 2021

mperham commented Sep 6, 2021

franzliedke commented Sep 6, 2021

mperham commented Sep 7, 2021

franzliedke commented Sep 7, 2021

franzliedke commented Sep 7, 2021

mperham commented Sep 7, 2021

franzliedke commented Sep 8, 2021

mperham commented Sep 8, 2021

Reconnect on "UNBLOCKED force unblock" errors #4985

Reconnect on "UNBLOCKED force unblock" errors #4985

Conversation

franzliedke commented Sep 6, 2021

Details

mperham commented Sep 6, 2021

franzliedke commented Sep 6, 2021

mperham commented Sep 7, 2021

franzliedke commented Sep 7, 2021

franzliedke commented Sep 7, 2021

mperham commented Sep 7, 2021

franzliedke commented Sep 8, 2021

mperham commented Sep 8, 2021