New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection leak with BLPOP #924
Comments
So as you noted I would like to deprecated the synchrony driver. However I'm interested in this connection leak problem. #524 has been opened for years, but I have little context over it. Are you able to reliably produce these leaks? If please submit a repro script, I'd be interested in finding a solution for this. |
I'm afraid I don't have any reliable minimal/local repro, as I mainly experience this in my full-blown production environment. I've gone ahead and made a minimal copy of my problematic logic in https://gist.github.com/ajvondrak/958962a0600cf847465a059210f266e7, but I can't be sure it'll reproduce the exact circumstances I see in production. Goliath is run like $ bundle exec ruby repro.rb -h
Usage: <server> [options]
Server options:
-e, --environment NAME Set the execution environment (default: development)
-a, --address HOST Bind to HOST address (default: 0.0.0.0)
-p, --port PORT Use PORT (default: 9000)
-S, --socket FILE Bind to unix domain socket
-E, --einhorn Use Einhorn socket manager
Daemon options:
-u, --user USER Run as specified user
-c, --config FILE Config file (default: ./config/<server>.rb)
-d, --daemonize Run daemonized in the background (default: false)
-l, --log FILE Log to file (default: off)
-s, --stdout Log to stdout (default: false)
-P, --pid FILE Pid file (default: off)
SSL options:
--ssl Enables SSL (default: off)
--ssl-key FILE Path to private key
--ssl-cert FILE Path to certificate
--ssl-verify Enables SSL certificate verification
Common options:
-C, --console Start a console
-v, --verbose Enable verbose logging (default: false)
-h, --help Display help message |
I'm not so surprised to be honest :/ |
Same 😂 |
The (Thanks for the detailed investigation though). |
Awhile back I had an issue where using
Redis#blpop
in my code would reliably leak connections. I forgot to open an issue about it, but I seemed to fix it, so here's my elaborate war story.Setup
redis.set(id, lock, nx: true, ex: 5.minutes)
, then fill a list with data to dole out across the subsequent requests. When requests would fail to acquire the "head" lock, they could assume they were "tail" requests and thus fall back to trying toredis.blpop(key, timeout: 1)
to get the data populated by the first request. UsingBLPOP
was important because of the time between acquiring the "head" lock and the list being populated, lest we go:LPOP
returns nilLPUSH
ECONNRESET
I was worried about deploying this, because I had some indications that these parallel requests would be a problem when I tested on localhost in a single process:
$ cat urls | xargs -n 1 -P 2 curl -w "\n\n"
Within the code, we were swallowing a bunch of errors that these parallel requests were triggering. When I debugged, I found a steady stream of
ECONNRESET
errors during the lock request - the connection was just being dropped. This meant that no one could even figure out whether they had the lock acquired, let alone pop off data. Similar issues are discussed sparsely online, like in #598. But even fiddling with reconnect algorithms, increasing timeouts, etc didn't really do anything to change the behavior for the most part.At first, the only thing I found that worked in the slightest was not using the synchrony driver. Switching to
:ruby
or:hiredis
meant that my localhost could suddenly handle multiple connections without immediately erroring out. Cranking the-P
flag toxargs
up to more & more processes still eventually broke, though.However, deploying this in production was a disaster. I think what happened was that the blocked requests would just pile up and prevent EM from serving other requests, thus crashing all of the machines in the cluster. So I reverted to using the synchrony driver, which kept the machines alive, but still likely left us with a stream of silent
ECONNRESET
errors. However, we still saw the lock/pop algorithm being successful, so we let it be. I think the success was probably just because we could distribute the requests between separate processes/machines, so theECONNRESET
s weren't having the same impact as they would on a single localhost process.The connection leak
While my app's cluster remained online, RedisLabs contacted us about our connections spiking up to ~70k. The client wasn't closing its connections for some reason. The two ideas they came up with were either:
Redis#quit
each time. We'd incur the overhead of reconnecting on each request, but maybe that's survivable.CLIENT LIST
and manuallyCLIENT KILL
off idle connections.The first was easy to try, but didn't have any impact. Our servers were still holding onto idle connections somehow. I vetoed trying the second solution, because it seemed like papering over an actual bug.
Instrumentation to the rescue
So to dig into the problem, I hand-instrumented the individual redis-rb methods on a canary instance in production with Honeycomb. I discovered a pattern with the leaked connections: each
Redis#blpop
kept getting repeatedECONNRESET
errors (like we saw before) until they finally timed out. Each repeat seemed to trigger a disconnect/reconnect, which led to a connection that failed to close.The logic in redis-rb looked fishy:
Redis#blpop
callsRedis#_bpop
:redis-rb/lib/redis.rb
Line 1218 in 41395e9
Redis#_bpop
callsRedis::Client#call_with_timeout
using the client timeout + the argument's timeout (so 5 seconds + 1 second):redis-rb/lib/redis.rb
Line 1192 in 41395e9
Redis::Client#call_with_timeout
retries endlessly onConnectionError
:redis-rb/lib/redis/client.rb
Line 222 in 41395e9
Redis::Client#io
casts theECONNRESET
I kept seeing into aConnectionError
:redis-rb/lib/redis/client.rb
Line 267 in 41395e9
Redis::Client#io
also may eventually raise aTimeoutError
:redis-rb/lib/redis/client.rb
Line 265 in 41395e9
I couldn't be sure, but it looked like maybe the
Redis::Client#ensure_connected
method might do some hanky things with respect to connecting & disconnecting each time we retry - and retries were endless for theECONNRESET
errors until we finally timed out. So I wondered if maybe EventMachine got overwhelmed (or something??) so that idle connections keep piling up without getting closed on the client side.This wasn't a fully-baked idea, but I circumvented all the timeout handling by changing the relevant code like so:
When I tried this code out on a canary instance, it stayed stable at only ~12 connections instead of thousands. I'd expected 8 connections per box, since that's how many processes I had with separate redis-rb clients, but maybe the disconnect/reconnect of
#ensure_connected
crops up naturally sometimes? I don't know.The remaining boxes in my cluster still swept in to fill up our max connections on RedisLabs, though. So I deployed this fix to the whole cluster to see if it made a dent. It completely fixed the issue, so I left it at that.
Now what?
I think one good reason to consider this issue is that it raises a question around the timeout handling. I don't see why we need to block the redis-rb read for those 6 seconds instead of 1, since
BLPOP
already takes care of its timeouts server-side. For my purposes, I didn't even want theBLPOP
to take a full second, since we were subject to more stringent timeouts. That's just the lowest resolution the timeout has in Redis.So anyway, I thought this was an interesting problem. I don't know enough to say for certain what the issue is, but I think it says something that I could fix the leak by inlining the
Redis#call
. It seems to be the complex convergence of several different problems, and I'm not exactly sure how. Something to chew on!The text was updated successfully, but these errors were encountered: