ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

sos4nt · 2023-01-27T15:13:52Z

Ruby version: 3.0.4
Rails version: n/a
Sidekiq / Pro / Enterprise version(s): 7.0.3

Possibly related to #5702, #5685, and #5684.

We recently upgraded to Sidekiq 7 and encountered ConnectionPool::TimeoutError exceptions in our workers after setting up a single-threaded capsule along with Sidekiq's rate limiting. For now, we've reverted our changes to not using capsules, but I really wanted to get this working.

Tracking down the cause wasn't easy because the errors occurred sporadically and didn't seem to follow a pattern. However, after a lot of digging in the depth of Sidekiq's codebase, I finally managed to create a small setup to reproduce the exception: (for simplicity without Rails)

# sidekiq.rb

require 'sidekiq-ent'

Sidekiq.configure_server do |config|
  config.capsule('single-threaded') do |cap|
    cap.concurrency = 1
    cap.queues = %w[single]
  end
end

LIMITER = Sidekiq::Limiter.concurrent('limiter', 30)

class WorkerA
  include Sidekiq::Job
  sidekiq_options queue: :default, retry: false, backtrace: true

  def perform(value)
    LIMITER.within_limit { puts(value) }
  end
end

class WorkerB
  include Sidekiq::Job
  sidekiq_options queue: :single, retry: false, backtrace: true

  def perform(value)
    LIMITER.within_limit { puts(value) }
  end
end

I'm starting Sidekiq with sidekiq -r ./sidekiq.rb. Inside IRB, I can now schedule WorkerB which runs in the "single" capsule and then WorkerA which runs in the "default" capsule: (the order of execution is important)

$ irb -r ./sidekiq.rb
irb(main):001:0> WorkerB.perform_async('hello from worker b')
=> "c5360245000d1e1df50e9f79"
irb(main):002:0> WorkerA.perform_async('hello from worker a')
=> "2bf003b6f8214c6dffb90ff5"

The Sidekiq process / log shows that WorkerB runs fine but WorkerA is causing the aforementioned exception

2023-01-27T14:14:47.981Z pid=14758 tid=9rm class=WorkerB jid=c5360245000d1e1df50e9f79 INFO: start
hello from worker b
2023-01-27T14:14:47.983Z pid=14758 tid=9rm class=WorkerB jid=c5360245000d1e1df50e9f79 elapsed=0.002 INFO: done

2023-01-27T14:15:21.275Z pid=14758 tid=9qy class=WorkerA jid=2bf003b6f8214c6dffb90ff5 elapsed=1.005 INFO: fail
2023-01-27T14:15:21.275Z pid=14758 tid=9qy WARN: {"context":"Job raised exception","job":{"retry":false,"queue":"default","backtrace":true,"args":["hello from worker a"],"class":"WorkerA","jid":"2bf003b6f8214c6dffb90ff5","created_at":1674828920.269701,"enqueued_at":1674828920.269772},"_config":"#<Sidekiq::Config:0x00007fa7a49ded08>"}
2023-01-27T14:15:21.275Z pid=14758 tid=9qy WARN: ConnectionPool::TimeoutError: Waited 1 sec, 0/1 available

Along with a long stack trace pointing to sidekiq-ent/limiter/concurrent.rb.

I could also get this exception when running WorkerA before WorkerB but not as consistent (maybe a timing issue with the connection pool). In this case, WorkerB will fail and the error message is "Waited 1 sec, 0/5 available".

I further tried to debug the problem and found that the redis_pool inside the Sidekiq::Limiter::Concurrent class was set to a size of 1 for both workers. Finally, I found the redis_pool method in Sidekiq::Limiter which looks like this:

# sidekiq-ent-7.0.3/lib/sidekiq-ent/limiter.rb

module Sidekiq
  module Limiter
    # ...
    def self.redis_pool
      @redis ||= Sidekiq.redis_pool
    end
    # ...
  end
end

If I understand the code correctly, this will memoize the redis pool from Thread.current[:sidekiq_capsule] which is the first capsule being used. When removing the @redis ||= part, the above example works fine, i.e.

    def self.redis_pool
      Sidekiq.redis_pool
    end

With this change applied, the redis_pool.size inside Sidekiq::Limiter::Concurrent changed from 1 to 5 and back depending on the worker / queue (as it should?) an no exception occurred anymore.

Although this fix "works", I'm not sure if this is the correct way to solve the issue. Besides, I didn't really understand why running a worker inside a wrong capsule / config would immediately cause a ConnectionPool::TimeoutError in the first place. Is the connection pool size synchronized with the number of Sidekiq threads? (maybe you can shed some light on the way connection pools are used in Sidekiq, just out of interest)

I hope you can fix this problem soon with the given information.

The text was updated successfully, but these errors were encountered:

mperham · 2023-01-27T15:39:10Z

Excellent debugging, well done.

I think the proper fix is removing the = sign from ||=. With Sidekiq 6.x you can either provide a custom connection pool for limiter usage or you can reuse Sidekiq's global pool. Because that global pool is now per-capsule in Sidekiq 7.x, we can't memoize the pool anymore.

sos4nt changed the title ~~ConnectionPool::TimeoutError exception when using limiter inside capsule (with fix suggestion)~~ ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) Jan 27, 2023

mperham closed this as completed Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

sos4nt commented Jan 27, 2023 •

edited

mperham commented Jan 27, 2023

ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

Comments

sos4nt commented Jan 27, 2023 • edited

mperham commented Jan 27, 2023

sos4nt commented Jan 27, 2023 •

edited