Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect on "UNBLOCKED force unblock" errors #4985

Merged
merged 1 commit into from Sep 7, 2021

Conversation

franzliedke
Copy link
Contributor

Thanks a lot for Sidekiq!

In our setup with a Redis cluster, we are having regular failovers,
which is fine. We are currently improving the way our application
handles these situations and stumbled across a possible improvement
in Sidekiq itself.

By far the most frequent error in our error monitoring system is the
following:

UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)

These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
#2550 and #4495) this way, it seems senbile to also handle this one.

Details

Stacktrace
Redis::CommandError: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:132:in `call'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:226:in `block in call_with_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:300:in `with_socket_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis/client.rb:225:in `call_with_timeout'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1224:in `block in _bpop'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:69:in `synchronize'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1221:in `_bpop'
  from vendor/ruby/2.7.0/gems/redis-4.2.5/lib/redis.rb:1266:in `brpop'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/fetch.rb:47:in `block in retrieve_work'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq.rb:98:in `block in redis'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:63:in `block (2 levels) in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `block in with'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `handle_interrupt'
  from vendor/ruby/2.7.0/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `with'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq.rb:95:in `redis'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/fetch.rb:47:in `retrieve_work'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:83:in `get_one'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:95:in `fetch'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:77:in `process_one'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/processor.rb:68:in `run'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/util.rb:43:in `watchdog'
  from vendor/ruby/2.7.0/gems/sidekiq-6.2.1/lib/sidekiq/util.rb:52:in `block in safe_thread'
  from vendor/ruby/2.7.0/gems/logging-2.3.0/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'

@mperham
Copy link
Collaborator

mperham commented Sep 6, 2021

Any reason not to trigger on the UNBLOCKED code, just like the others?

@franzliedke
Copy link
Contributor Author

I wasn't sure about that part, either - the part I am matching on seems more related to the failover scenario, which seems to be what we're interested in. I don't know what other scenarios could lead to such an error.

@mperham
Copy link
Collaborator

mperham commented Sep 7, 2021

I think the uppercase word is the "error code". The message after it is informative and could be translated into other languages, for instance. We should perform logic on the error code only.

These errors can occur during Sidekiq's long-running job fetching
command. This uses Redis' blocking BRPOP primitive. On failover in a
cluster setup, these commands are interrupted by the server.

This error causes the worker threads to be restarted, but as they are
bubbled up to the top, they cause a lot of spam in our error logging
systems. As related errors from other commands are being handled (see
@franzliedke
Copy link
Contributor Author

That sounds sensible. Changed it.

@franzliedke
Copy link
Contributor Author

Oh, one more thing: I also considered handling this directly in the Processor code that executes the blocking BRPOP. Would you prefer that? I don't know enough about Sidekiq internals to say whether this may be relevant in other places.

@mperham
Copy link
Collaborator

mperham commented Sep 7, 2021

I think it's better to have a blanket policy to reset the connection upon any of these error codes.

@mperham mperham merged commit 237c70f into sidekiq:master Sep 7, 2021
@franzliedke
Copy link
Contributor Author

Thanks a lot!

@mperham I read somewhere you only do releases every few months. That is totally fine, of course. Is the main branch considered stable, though?

@mperham
Copy link
Collaborator

mperham commented Sep 8, 2021

Yes, main is stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants