Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection leak with BLPOP #924

Closed
ajvondrak opened this issue Jun 16, 2020 · 7 comments
Closed

Connection leak with BLPOP #924

ajvondrak opened this issue Jun 16, 2020 · 7 comments

Comments

@ajvondrak
Copy link

Awhile back I had an issue where using Redis#blpop in my code would reliably leak connections. I forgot to open an issue about it, but I seemed to fix it, so here's my elaborate war story.

Setup

  • Unfortunately, this code base is so old that we still use postrank-labs/goliath@6c41d0a4, which means we use the synchrony driver. This issue could well be isolated to EM shenanigans alone, but I still think it's interesting to consider. At the very least, another nail in EM's coffin and a vote of confidence for Deprecate the synchrony driver #915.
  • Relevant versions at the time of the issue:
    eventmachine (1.2.7)
    em-synchrony (1.0.3)
    hiredis (0.6.1)
    redis (4.0.2)
    
  • The feature we were trying to write was (and still is) a contrived solution to a very specific problem. We were basically relying on race conditions & Redis-side blocking to force an arbitrary serialization of parallel requests. The first request to arrive would acquire a lock using redis.set(id, lock, nx: true, ex: 5.minutes), then fill a list with data to dole out across the subsequent requests. When requests would fail to acquire the "head" lock, they could assume they were "tail" requests and thus fall back to trying to redis.blpop(key, timeout: 1) to get the data populated by the first request. Using BLPOP was important because of the time between acquiring the "head" lock and the list being populated, lest we go:
    Time Request 1 Request 2
    0 acquire lock
    1 fail to acquire lock
    2 LPOP returns nil
    3 LPUSH

ECONNRESET

I was worried about deploying this, because I had some indications that these parallel requests would be a problem when I tested on localhost in a single process:

$ cat urls | xargs -n 1 -P 2 curl -w "\n\n"

Within the code, we were swallowing a bunch of errors that these parallel requests were triggering. When I debugged, I found a steady stream of ECONNRESET errors during the lock request - the connection was just being dropped. This meant that no one could even figure out whether they had the lock acquired, let alone pop off data. Similar issues are discussed sparsely online, like in #598. But even fiddling with reconnect algorithms, increasing timeouts, etc didn't really do anything to change the behavior for the most part.

At first, the only thing I found that worked in the slightest was not using the synchrony driver. Switching to :ruby or :hiredis meant that my localhost could suddenly handle multiple connections without immediately erroring out. Cranking the -P flag to xargs up to more & more processes still eventually broke, though.

However, deploying this in production was a disaster. I think what happened was that the blocked requests would just pile up and prevent EM from serving other requests, thus crashing all of the machines in the cluster. So I reverted to using the synchrony driver, which kept the machines alive, but still likely left us with a stream of silent ECONNRESET errors. However, we still saw the lock/pop algorithm being successful, so we let it be. I think the success was probably just because we could distribute the requests between separate processes/machines, so the ECONNRESETs weren't having the same impact as they would on a single localhost process.

The connection leak

While my app's cluster remained online, RedisLabs contacted us about our connections spiking up to ~70k. The client wasn't closing its connections for some reason. The two ideas they came up with were either:

  1. Close the connection explicitly with Redis#quit each time. We'd incur the overhead of reconnecting on each request, but maybe that's survivable.
  2. Create a script that periodically does a CLIENT LIST and manually CLIENT KILL off idle connections.

The first was easy to try, but didn't have any impact. Our servers were still holding onto idle connections somehow. I vetoed trying the second solution, because it seemed like papering over an actual bug.

Instrumentation to the rescue

So to dig into the problem, I hand-instrumented the individual redis-rb methods on a canary instance in production with Honeycomb. I discovered a pattern with the leaked connections: each Redis#blpop kept getting repeated ECONNRESET errors (like we saw before) until they finally timed out. Each repeat seemed to trigger a disconnect/reconnect, which led to a connection that failed to close.

The logic in redis-rb looked fishy:

I couldn't be sure, but it looked like maybe the Redis::Client#ensure_connected method might do some hanky things with respect to connecting & disconnecting each time we retry - and retries were endless for the ECONNRESET errors until we finally timed out. So I wondered if maybe EventMachine got overwhelmed (or something??) so that idle connections keep piling up without getting closed on the client side.

This wasn't a fully-baked idea, but I circumvented all the timeout handling by changing the relevant code like so:

-    redis.blpop(key, timeout: 1)
+    redis.call([:blpop, key, 1])

When I tried this code out on a canary instance, it stayed stable at only ~12 connections instead of thousands. I'd expected 8 connections per box, since that's how many processes I had with separate redis-rb clients, but maybe the disconnect/reconnect of #ensure_connected crops up naturally sometimes? I don't know.

The remaining boxes in my cluster still swept in to fill up our max connections on RedisLabs, though. So I deployed this fix to the whole cluster to see if it made a dent. It completely fixed the issue, so I left it at that.

Now what?

I think one good reason to consider this issue is that it raises a question around the timeout handling. I don't see why we need to block the redis-rb read for those 6 seconds instead of 1, since BLPOP already takes care of its timeouts server-side. For my purposes, I didn't even want the BLPOP to take a full second, since we were subject to more stringent timeouts. That's just the lowest resolution the timeout has in Redis.

So anyway, I thought this was an interesting problem. I don't know enough to say for certain what the issue is, but I think it says something that I could fix the leak by inlining the Redis#call. It seems to be the complex convergence of several different problems, and I'm not exactly sure how. Something to chew on!

@byroot
Copy link
Collaborator

byroot commented Jun 17, 2020

So as you noted I would like to deprecated the synchrony driver.

However I'm interested in this connection leak problem. #524 has been opened for years, but I have little context over it.

Are you able to reliably produce these leaks? If please submit a repro script, I'd be interested in finding a solution for this.

@ajvondrak
Copy link
Author

I'm afraid I don't have any reliable minimal/local repro, as I mainly experience this in my full-blown production environment. I've gone ahead and made a minimal copy of my problematic logic in https://gist.github.com/ajvondrak/958962a0600cf847465a059210f266e7, but I can't be sure it'll reproduce the exact circumstances I see in production.

Goliath is run like

$ bundle exec ruby repro.rb -h
Usage: <server> [options]

Server options:
    -e, --environment NAME           Set the execution environment (default: development)
    -a, --address HOST               Bind to HOST address (default: 0.0.0.0)
    -p, --port PORT                  Use PORT (default: 9000)
    -S, --socket FILE                Bind to unix domain socket
    -E, --einhorn                    Use Einhorn socket manager

Daemon options:
    -u, --user USER                  Run as specified user
    -c, --config FILE                Config file (default: ./config/<server>.rb)
    -d, --daemonize                  Run daemonized in the background (default: false)
    -l, --log FILE                   Log to file (default: off)
    -s, --stdout                     Log to stdout (default: false)
    -P, --pid FILE                   Pid file (default: off)

SSL options:
        --ssl                        Enables SSL (default: off)
        --ssl-key FILE               Path to private key
        --ssl-cert FILE              Path to certificate
        --ssl-verify                 Enables SSL certificate verification

Common options:
    -C, --console                    Start a console
    -v, --verbose                    Enable verbose logging (default: false)
    -h, --help                       Display help message

@ajvondrak
Copy link
Author

ajvondrak commented Sep 16, 2020

The plot thickens (or maybe thinnens?). 🧐 While the BLPOP code from above hammered the reconnects (and thus the leaked connections), I just had another experience that leads me to believe this only has to do with the synchrony driver.

I'm in the process of rewriting a service that's based on eventmachine. I have traffic partially split between the old & new code, both still running on top of eventmachine. Recently I've been ramping traffic up more & more on the new code. And go figure: another connection leak issue, albeit slower than the BLPOP issue from before. We weren't even doing the BLPOP stuff on the new code base, so it probably had something to do with the Redis connections themselves.

I dug around, checking gem versions between the old & new code, etc. In searching, I discovered that the old code base had been using the Redis::Connection::Ruby driver all along. This wasn't on purpose: we're on eventmachine, you'd figure we'd want to use Redis::Connection::Synchrony. So that's what the new code was doing. But as soon as it got more traffic, there went our connection count.

So I reverted the new code to using Redis::Connection::Ruby, and not only did our connection count plummet, I stopped seeing any Redis connection errors in production - which had otherwise been a long-running feature on the new code base.

A graph from https://redislabs.com showing the connection count dropping from ~11k to ~350.

[A graph from https://honeycomb.io showing Redis errors, which drop to zero after I switched the driver from synchrony to ruby.

And just as a sanity check, the Redis activity has been humming along, just without the errors.

A graph from https://honeycomb.io showing that Redis errors stopped, but Redis activity still actually continued after I switched the driver.

@byroot
Copy link
Collaborator

byroot commented Sep 16, 2020

this only has to do with the synchrony driver

I'm not so surprised to be honest :/

@ajvondrak
Copy link
Author

I'm not so surprised to be honest :/

Same 😂

@byroot
Copy link
Collaborator

byroot commented Aug 17, 2022

The synchrony driver has been removed in 5.0. Closing.

(Thanks for the detailed investigation though).

@byroot byroot closed this as completed Aug 17, 2022
@ajvondrak
Copy link
Author

The synchrony driver has been removed in 5.0.

Ding dong, the witch is dead

^ I'm allowed to post that, now that I'm no longer using synchrony in prod. 😂 Thanks for your attention to the ticket, and for keeping it updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants