Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) #5752

Closed
sos4nt opened this issue Jan 27, 2023 · 1 comment

Comments

@sos4nt
Copy link

sos4nt commented Jan 27, 2023

Ruby version: 3.0.4
Rails version: n/a
Sidekiq / Pro / Enterprise version(s): 7.0.3

Possibly related to #5702, #5685, and #5684.

We recently upgraded to Sidekiq 7 and encountered ConnectionPool::TimeoutError exceptions in our workers after setting up a single-threaded capsule along with Sidekiq's rate limiting. For now, we've reverted our changes to not using capsules, but I really wanted to get this working.

Tracking down the cause wasn't easy because the errors occurred sporadically and didn't seem to follow a pattern. However, after a lot of digging in the depth of Sidekiq's codebase, I finally managed to create a small setup to reproduce the exception: (for simplicity without Rails)

# sidekiq.rb

require 'sidekiq-ent'

Sidekiq.configure_server do |config|
  config.capsule('single-threaded') do |cap|
    cap.concurrency = 1
    cap.queues = %w[single]
  end
end

LIMITER = Sidekiq::Limiter.concurrent('limiter', 30)

class WorkerA
  include Sidekiq::Job
  sidekiq_options queue: :default, retry: false, backtrace: true

  def perform(value)
    LIMITER.within_limit { puts(value) }
  end
end

class WorkerB
  include Sidekiq::Job
  sidekiq_options queue: :single, retry: false, backtrace: true

  def perform(value)
    LIMITER.within_limit { puts(value) }
  end
end

I'm starting Sidekiq with sidekiq -r ./sidekiq.rb. Inside IRB, I can now schedule WorkerB which runs in the "single" capsule and then WorkerA which runs in the "default" capsule: (the order of execution is important)

$ irb -r ./sidekiq.rb
irb(main):001:0> WorkerB.perform_async('hello from worker b')
=> "c5360245000d1e1df50e9f79"
irb(main):002:0> WorkerA.perform_async('hello from worker a')
=> "2bf003b6f8214c6dffb90ff5"

The Sidekiq process / log shows that WorkerB runs fine but WorkerA is causing the aforementioned exception

2023-01-27T14:14:47.981Z pid=14758 tid=9rm class=WorkerB jid=c5360245000d1e1df50e9f79 INFO: start
hello from worker b
2023-01-27T14:14:47.983Z pid=14758 tid=9rm class=WorkerB jid=c5360245000d1e1df50e9f79 elapsed=0.002 INFO: done

2023-01-27T14:15:21.275Z pid=14758 tid=9qy class=WorkerA jid=2bf003b6f8214c6dffb90ff5 elapsed=1.005 INFO: fail
2023-01-27T14:15:21.275Z pid=14758 tid=9qy WARN: {"context":"Job raised exception","job":{"retry":false,"queue":"default","backtrace":true,"args":["hello from worker a"],"class":"WorkerA","jid":"2bf003b6f8214c6dffb90ff5","created_at":1674828920.269701,"enqueued_at":1674828920.269772},"_config":"#<Sidekiq::Config:0x00007fa7a49ded08>"}
2023-01-27T14:15:21.275Z pid=14758 tid=9qy WARN: ConnectionPool::TimeoutError: Waited 1 sec, 0/1 available

Along with a long stack trace pointing to sidekiq-ent/limiter/concurrent.rb.

I could also get this exception when running WorkerA before WorkerB but not as consistent (maybe a timing issue with the connection pool). In this case, WorkerB will fail and the error message is "Waited 1 sec, 0/5 available".


I further tried to debug the problem and found that the redis_pool inside the Sidekiq::Limiter::Concurrent class was set to a size of 1 for both workers. Finally, I found the redis_pool method in Sidekiq::Limiter which looks like this:

# sidekiq-ent-7.0.3/lib/sidekiq-ent/limiter.rb

module Sidekiq
  module Limiter
    # ...
    def self.redis_pool
      @redis ||= Sidekiq.redis_pool
    end
    # ...
  end
end

If I understand the code correctly, this will memoize the redis pool from Thread.current[:sidekiq_capsule] which is the first capsule being used. When removing the @redis ||= part, the above example works fine, i.e.

    def self.redis_pool
      Sidekiq.redis_pool
    end

With this change applied, the redis_pool.size inside Sidekiq::Limiter::Concurrent changed from 1 to 5 and back depending on the worker / queue (as it should?) an no exception occurred anymore.

Although this fix "works", I'm not sure if this is the correct way to solve the issue. Besides, I didn't really understand why running a worker inside a wrong capsule / config would immediately cause a ConnectionPool::TimeoutError in the first place. Is the connection pool size synchronized with the number of Sidekiq threads? (maybe you can shed some light on the way connection pools are used in Sidekiq, just out of interest)

I hope you can fix this problem soon with the given information.

@sos4nt sos4nt changed the title ConnectionPool::TimeoutError exception when using limiter inside capsule (with fix suggestion) ConnectionPool::TimeoutError when using limiter inside capsule (with fix suggestion) Jan 27, 2023
@mperham
Copy link
Collaborator

mperham commented Jan 27, 2023

Excellent debugging, well done.

I think the proper fix is removing the = sign from ||=. With Sidekiq 6.x you can either provide a custom connection pool for limiter usage or you can reuse Sidekiq's global pool. Because that global pool is now per-capsule in Sidekiq 7.x, we can't memoize the pool anymore.

@mperham mperham closed this as completed Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants