Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to restart Sidekiq pro on Heroku - Redis::CommandError: ERR only (P)SUBSCRIBE / (P)UNSUBSCRIBE / PING / QUIT allowed in this context #3886

Closed
hugobarthelemy opened this issue Jul 13, 2018 · 9 comments

Comments

@hugobarthelemy
Copy link

ruby 2.3.7p456
Sidekiq Pro 3.5.3
Redis 3.3.5

Hi all,

Since yesterday, we see a lot of errors appear in Sidekiq on one of our application on Heroku. Since this morning we can not restart Sidekiq. After each restart of the application, Sidekiq crashes with the trace below

Here is our configuration:

require "sidekiq/worker_killer"

SIDEKIQ_REDIS_CONFIG = begin
  uri = URI.parse(APP_CONFIG["sidekiq_redis_url"])
  if APP_CONFIG["redis_ssl_enabled"]
    uri.scheme = "rediss"
    uri.port = Integer(uri.port) + 1
  end
  {
    url: uri.to_s,
    driver: :ruby,
    ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE }
  }
end

# Use global logger & log level for Sidekiq
Sidekiq::Logging.logger = LogStrategy.logger

if Rails.env.test?
  ## Testing
  require "sidekiq/testing"
  Rails.logger.warn "SIDEKIQ configured in fake mode (test env)"
  Sidekiq::Testing.fake!

  # Sidekiq API does not support testing mode so
  # we need a running Redis even in test mode
  # see https://github.com/mperham/sidekiq/wiki/Testing#api
  Sidekiq.configure_client { |config| config.redis = SIDEKIQ_REDIS_CONFIG }
  Sidekiq.configure_server { |config| config.redis = SIDEKIQ_REDIS_CONFIG }
else
  # Duration a job status is kept in Redis (countdown restarts when status is updated)
  sidekiq_status_expiration = 15.minutes

  ## Server configuration
  Sidekiq.configure_server do |config|
    # Reliability
    config.super_fetch!
    config.reliable_scheduler!
    # Redis connection config
    config.redis = SIDEKIQ_REDIS_CONFIG
    # Additional middleware(s)
    config.server_middleware do |chain|
      chain.add Sidekiq::Status::ServerMiddleware, expiration: sidekiq_status_expiration
      # Enable query cache during jobs execution
      chain.add Sidekiq::QueryCacheMiddleware
      # Enable worker killer
      chain.add Sidekiq::WorkerKiller, max_rss: APP_CONFIG["sidekiq_killer_max_rss"],
                                       grace_time: APP_CONFIG["sidekiq_killer_grace_time"]
    end
    config.client_middleware do |chain|
      chain.add Sidekiq::Status::ClientMiddleware, expiration: sidekiq_status_expiration
    end
    # Poll interval by each sidekiq process
    config.average_scheduled_poll_interval = 5 # seconds
    # Specific DB pool size matching max concurrency
    # => Should at least match concurrency defined in sidekiq.yml
    config = ActiveRecord::Base.configurations[Rails.env]
    config['pool'] = Sidekiq.options[:concurrency] + 2
    ActiveRecord::Base.establish_connection(config)
  end

  # This should not go in a Sidekiq.configure_{client,server} block.
  Sidekiq::Client.reliable_push! unless Rails.env.test?

  ## Client configuration
  Sidekiq.configure_client do |config|
    # Redis connection config
    config.redis = SIDEKIQ_REDIS_CONFIG
    # Additional middleware(s)
    config.client_middleware do |chain|
      chain.add Sidekiq::Status::ClientMiddleware, expiration: sidekiq_status_expiration
    end
  end

  ## Override default options
  Sidekiq.default_worker_options = {
    queue: "api_default",
    retry: 3
  }

end

Here is our backtrace:

app[worker.1]: I, [2018-07-13T12:35:11.411473 #4]  INFO -- : Schedules Loaded
app[worker.1]: I, [2018-07-13T12:35:11.414173 #4]  INFO -- : Starting processing, hit Ctrl-C to stop
app[worker.1]: I, [2018-07-13T12:35:11.421439 #4]  INFO -- : SuperFetch activated
app[worker.1]: W, [2018-07-13T12:35:11.448078 #4]  WARN -- : SuperFetch: recovered 1 jobs
app[worker.1]: I, [2018-07-13T12:35:11.557614 #4]  INFO -- : start
app[worker.1]: I, [2018-07-13T12:35:11.566388 #4]  INFO -- : start
app[worker.1]: I, [2018-07-13T12:35:11.575462 #4]  INFO -- : start
app[worker.1]: APNSNotifierWorker - starting job for device 9d1c6655-6ef4-4dca-a0c6-722956f621ed
app[worker.1]: APNSNotifierWorker - starting job for device cd42a290-c645-49d0-aedc-022c35bcfe51
app[worker.1]: APNSNotifierWorker - starting job for device 5d81312a-5380-4fc9-a4b8-146cc1806281
app[worker.1]: Socket was remotely closed
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:104:in `rescue in block (2 levels) in ensure_open'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:98:in `block (2 levels) in ensure_open'
app[worker.1]: ** [NewRelic][2018-07-13 12:35:12 +0000 worker.1 (4)] INFO : Starting Agent shutdown
app[worker.1]: W, [2018-07-13T12:35:12.239228 #4]  WARN -- : {"context":"processor"}
app[worker.1]: W, [2018-07-13T12:35:12.239329 #4]  WARN -- : Redis::CommandError: ERR only (P)SUBSCRIBE / (P)UNSUBSCRIBE / PING / QUIT allowed in this context
app[worker.1]: W, [2018-07-13T12:35:12.239443 #4]  WARN -- : /app/vendor/bundle/ruby/2.3.0/bundler/gems/redis-rb-95229df721f7/lib/redis/client.rb:121:in `call'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.0.325/lib/new_relic/agent/instrumentation/redis.rb:42:in `block in call'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.0.325/lib/new_relic/agent/datastores.rb:111:in `wrap'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.0.325/lib/new_relic/agent/instrumentation/redis.rb:41:in `call'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/bundler/gems/redis-rb-95229df721f7/lib/redis.rb:752:in `block in set'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/bundler/gems/redis-rb-95229df721f7/lib/redis.rb:58:in `block in synchronize'
app[worker.1]: /app/vendor/ruby-2.3.7/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/bundler/gems/redis-rb-95229df721f7/lib/redis.rb:58:in `synchronize'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/bundler/gems/redis-rb-95229df721f7/lib/redis.rb:750:in `set'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-pro-3.5.3/lib/sidekiq/pro/super_fetch.rb:60:in `block in orphan_check?'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq.rb:95:in `block in redis'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:65:in `block (2 levels) in with'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `handle_interrupt'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:64:in `block in with'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `handle_interrupt'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/connection_pool-2.2.2/lib/connection_pool.rb:61:in `with'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq.rb:92:in `redis'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-pro-3.5.3/lib/sidekiq/pro/super_fetch.rb:59:in `orphan_check?'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-pro-3.5.3/lib/sidekiq/pro/super_fetch.rb:20:in `initialize'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/processor.rb:38:in `new'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/processor.rb:38:in `initialize'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/manager.rb:95:in `new'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/manager.rb:95:in `block in processor_died'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/manager.rb:92:in `synchronize'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/manager.rb:92:in `processor_died'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/processor.rb:76:in `rescue in run'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/processor.rb:67:in `run'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/util.rb:17:in `watchdog'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/sidekiq-4.2.10/lib/sidekiq/util.rb:26:in `block in safe_thread'
heroku[worker.1]: State changed from up to crashed
@mperham
Copy link
Collaborator

mperham commented Jul 13, 2018 via email

@hugobarthelemy
Copy link
Author

Unfortunately, we have not changed anything

@mperham
Copy link
Collaborator

mperham commented Jul 13, 2018 via email

@hugobarthelemy
Copy link
Author

We are simply more attentive because it is our staging API and the mobile test its new version of application. But production works well. On the other hand we noticed that the error Redis::CommandError: ERR only (P)SUBSCRIBE / (P)UNSUBSCRIBE / PING / QUIT allowed in this context is present in production. Except that it does not stop restarting Sidekiq

@ccyrille
Copy link

ccyrille commented Jul 13, 2018

Hello, thanks a lot @mperham for your quick reply !

I found out that the net-http2 was the origin the problem, from Hugo's logs :

app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:104:in `rescue in block (2 levels) in ensure_open'
app[worker.1]: /app/vendor/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:98:in `block (2 levels) in ensure_open'

Not sure why but all the logs we have either in staging or production point to this cause. After upgrading net-http2 from version 0.14.1 to 0.18.0, problem solved 😄

UPDATE

I didn't manage to really reproduce the issue, even by going back to net-http2 0.14.1. So I can be 100% sure that the net-http2 upgrade did the trick.

One other interesting thing I found is that I don't have an :error callback configured for my net-http2 connections. Looking at the documentation, it says :

It is RECOMMENDED to set the :error callback: if none is defined, the underlying socket thread may raise an error in the main thread at unexpected execution times.

From my point of view, it could be the most probable cause for the problem. So I also added an handler like this to my project. I will try to keep you posted on the result :

conn.on(:error) do |exception|
  puts "Unexpected http-2 exception : #{exception}"
end

@krasnoukhov
Copy link
Contributor

krasnoukhov commented Jul 14, 2018

I'm pretty sure that original issue that @hugobarthelemy is experiencing is this: ostinelli/apnotic#68

Basically net-http2 versions before 0.18.0 don't work at all with Apple APNS sandbox anymore (production wasn't affected), so attempting to create a connection fails. I'm pretty sure it's still easily reproducible with older versions of net-http2 and APNS sandbox.

For us it surfaced a bigger problem that I described in this issue: ostinelli/apnotic#69
TL;DR using Apnotic::ConnectionPool as suggested in current documentation is unsafe and leads to crashes and lost jobs.

The safe way would be to create a pool manually and make sure to assign an error handler (as @ccyrille pointed out):

class Worker
  POOL = ConnectionPool.new(size: 5) do
    connection = Apnotic::Connection.new(...)

    connection.on(:error) do |err|
      Bugsnag.notify(ConnectionError.new(err.inspect))
    end

    connection
  end
end

We actually see a few dozen SocketError crashes in production every day. That means that before adding that error handler process was crashing every time it happened. Thanks to super_fetch, most of the jobs weren't actually lost.

@krasnoukhov
Copy link
Contributor

@mperham Is it possible that super_fetch would permanently lose job after a certain number of retries if process keeps picking up the job and then crashing in a scenario similar to the above? I feel like we were actually experiencing that specifically with long-running jobs, but can't confirm 100%

@mperham
Copy link
Collaborator

mperham commented Jul 14, 2018

No, super_fetch will never purposefully delete or drop a job. Poison pill jobs will continue to accumulate until the bug is fixed and they can be executed successfully.

@bensie
Copy link
Contributor

bensie commented Jul 17, 2018

I can confirm that bumping net-http2 to 0.18.0 fixed this issue for us as well (APNS related jobs). Thanks everyone for taking the time to post your issues and fix for something that's ultimately not a Sidekiq issue!

@mperham mperham closed this as completed Jul 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants