Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with reliability of Pub/Sub subscriptions in different Redis clients #7855

Open
adrianpasternak opened this issue Sep 28, 2020 · 4 comments

Comments

@adrianpasternak
Copy link

Describe the bug

I'm not sure if this is right place to report this issue, because it seems like a problem with Redis clients. But the same issue is present in all clients that I've checked (Lettuce, Redisson, Jedis, go-redis).

In a case of a sudden connection loss Redis clients are not able detect network problems, and will be listening for Pub/Sub messages on a broken TCP connection for hours, making Pub/Sub unusable.

To reproduce

  1. Start a Redis on Host A
  2. Connect to a Pub/Sub using one of the Redis clients from Host B
  3. Block all traffic on Host A to a Redis server using iptables or other tool
  4. Redis client will not discover that the connection is lost.
  5. Now restart Redis on Host A, and restore network traffic.
  6. Redis client will be listening on connection that no longer exist on the server-side.

I've managed to reproduce this behavior using three different Java clients, and go-redis. Ticket for Lettuce with more details: redis/lettuce#1428

Expected behavior

Redis clients subscribed to a Pub/Sub should be able to detect a broken network connection, and reconnect when necessary.

Additional information

The undocumented workaround for this issue is to tweak OS parameters on a client's host: SO_KEEPALIVE, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT.
It's similar to what redis-cli client is doing in application layer:

anetKeepAlive(NULL, context->fd, REDIS_CLI_KEEPALIVE_INTERVAL);

int anetKeepAlive(char *err, int fd, int interval)

Is there is any other way of making reliable Pub/Sub subscriptions without changing OS parameters?
Shouldn't all Redis clients change socket parameters in application layer like redis-cli?

@oranagra
Copy link
Member

Since redis (the server side) is no longer present, I don't presume anything can be done in the server side to mitigate it. It must be something on the client side, either the OS or client library.
TCP keepalive seems like the right solution (that's exactly what it was designed for AFAIK).

@yossigo do you see anything that can be done on our side other than document it? (which I'm not sure will help much)

@yossigo
Copy link
Member

yossigo commented Sep 29, 2020

@oranagra Theoretically we could come up with an application level keepalive mechanism where Redis periodically sends a heartbeat message. This would involve a lot of backwards compatibility issues and I am not sure there's a significant benefit that justifies it.

I think the best we can do is raise awareness to this issue with client maintainers, who should consider setting TCP keepalive by default on Pub/Sub connections.

@oranagra
Copy link
Member

if redis is sending keepalive messages it's the client's responsibility to detect that it's dead.
maybe instead the client can try to send some PING and detect a write failure when the socket is dead.
but i don't see any advantage for all of that over TCP KEEPALIVE.

@itamarhaber do you know where something like that can be documented? and how to bring this to the attention of existing client maintainers?

@tzickel
Copy link

tzickel commented Sep 30, 2020

  • This is a general issue with long-lived silent TCP connections, not specific to Redis nor Pub/Sub (What about a blocking operation with infinite timeout like BLPOP, there you can't even send PING but on Pub/Sub you can).

    It can happen in many ways, think about a connection pool, where one of the connection has been stalled like above, then the client tries to send a command on that connection, and never receives a response (what is a good timeout for that ?)...

  • Clients should provide sensible ways to try to mitigate the variety of issues that can arise from this:

  1. When taking a connection from a pool which have not been talked in awhile, to try a PING before using it (redis-py has that which is disabled by default):

    https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L676

  2. When possible (like in Pub/Sub), send software keepalive PINGs (the problem with that is it depends on how easy / portable is it to send PING once in a while without involving the end user of the library...).

  3. Allow for easy exposing of the OS level keepalive settings (most clients do this in a raw way which is not easy / portable), comparing:
    where you have to know the options for your OS
    https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L590
    vs.
    Where you just tell it the keepalive interval and it tries to be smart about it.
    https://github.com/tzickel/justredis/blob/master/justredis/sync/environments/threaded.py#L27

  • I had lots of strange issues in my code where sometimes some of the Redis connections would just hang for no good reason. It happened quite frequent that I ended enabling client side OS keepalive, which fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants