Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes 20 minutes to reconnect to redis #1466

Closed
RossRosem opened this issue Oct 20, 2020 · 14 comments
Closed

Takes 20 minutes to reconnect to redis #1466

RossRosem opened this issue Oct 20, 2020 · 14 comments
Labels
status: invalid An issue that we don't feel is valid

Comments

@RossRosem
Copy link

Bug Report

Current Behavior

Upon losing connection to redis it takes around 15 minutes for connection to be re-established. We consistently see the following over and over in the logs until connection is established.
19-10-2020 22:19:34.025 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] New http connection, requesting read 
19-10-2020 22:19:34.025 DEBUG r.netty.channel.BootstrapHandlers  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Initialized pipeline DefaultChannelPipeline{(BootstrapHandlers$BootstrapInitializerHandler#0 = reactor.netty.channel.BootstrapHandlers$BootstrapInitializerHandler), (reactor.left.httpCodec = io.netty.handler.codec.http.HttpServerCodec), (reactor.left.httpTrafficHandler = reactor.netty.http.server.HttpTrafficHandler), (reactor.right.reactiveBridge = reactor.netty.channel.ChannelOperationsHandler)} 
19-10-2020 22:19:34.025 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Increasing pending responses, now 1 
19-10-2020 22:19:34.025 DEBUG reactor.netty.http.server.HttpServer  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Handler is being applied: org.springframework.http.server.reactive.ReactorHttpHandlerAdapter@7ad3dcc9 
19-10-2020 22:19:34.028 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Detected non persistent http connection, preparing to close 
19-10-2020 22:19:34.029 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Last HTTP response frame 
19-10-2020 22:19:34.029 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Last HTTP packet was sent, terminating the channel 
19-10-2020 22:19:45.907 DEBUG r.n.http.server.HttpServerOperations  [id: 0xe376eb28, L:/10.0.0.123:8080 - R:/10.0.0.101:44736] New http connection, requesting read 

After 15 or so minutes we then see

stus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] Unexpected exception during request: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] userEventTriggered(ctx, SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)) 
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.CommandHandler correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] channelInactive() 
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.DefaultEndpoint correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, epid=0x1] deactivating endpoint handler 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.DefaultEndpoint correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, epid=0x1] notifyQueuedCommands adding 3 command(s) to buffer 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.CommandHandler correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] channelInactive() done 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] channelInactive() 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] scheduleReconnect() 

Expected behavior/code

We expect the reconnect to happen in a minute or less.

Note, we only see this behavior in Azure K8. Using same docker image (i.e. as one deployed in cloud) run locally (on our macs) we see reconnect happening in less than 1 minute.

Environment

  • Lettuce version(s): Lettuce 5.3.1.RELEASE
@RossRosem RossRosem added the type: bug A general bug label Oct 20, 2020
@mp911de
Copy link
Collaborator

mp911de commented Oct 20, 2020

Any reason this issue is closed? The logs from HTTP handlers are don't seem related to Lettuce. Connection timed out typically means that a connection attempt has timed out because the remote peer didn't accept the connection request in time.

@RossRosem
Copy link
Author

The Connection timed out actually is the one part that seems ok.. as that kicks off the process of the ConnectionWatchdog trying to reestablish the connection. The issue more is that it takes 10-15 minutes for this to happen. From what I can tell the "commandhandler" is not detecting that the channel is inactive.

@mp911de
Copy link
Collaborator

mp911de commented Oct 20, 2020

Thanks for clarifying. If a remote peer dies and comes back up there's no way to detect that state without having traffic. You can generally enable TCP keep-alive for more timely detection of dead peers. #1428 describes a workaround to configure keep-alive options.

Lettuce makes proper use of the TCP stack. Depending on what active components Azure introduces, you might experience a different behavior than on your local machine. I'd suggest reaching out to the Azure team to discuss that issue.

@RossRosem
Copy link
Author

Thanks for the feedback.

@mp911de
Copy link
Collaborator

mp911de commented Oct 20, 2020

May I close this ticket or is there anything else I can assist you with?

@RossRosem
Copy link
Author

yes you may... thanks

@mp911de mp911de added status: invalid An issue that we don't feel is valid and removed type: bug A general bug labels Oct 20, 2020
@rimvydas-pranciulis-ruptela
Copy link

Does extended keep-alive settings from #1428 really helps to solve this problem? What I found out is that keep-alive only works when connection is idle and there is no traffic. If there is constant traffic going between application and Redis, and Redis connection stops responding (but is not closed) we see tcp retransmission kicking in after which keep-alive messages aren't sent even if time between retransmission messages is higher than TCP_KEEPIDLE setting.

Count of tcp retransmissions is controlled by TCP_RETRIES2 linux setting which default value is 15, and only after that connection is closed from client OS side. Also retransmissions are repeated using some kind of exponential back-off policy, so in total it takes around 15-20 minutes for Lettuce to reconnect in such scenario. During this period all commands to Redis gets timeout errors. Couldn't Lettuce just re-establish connection if it get's x consecutive timeout errors on existing connection?

We are using Azure Redis and during it's maintenances we sometimes see behavior that Redis connection stays open and FIN packet doesn't come, but it stops responding, though after restarting application new connection is established successfully. Azure Support helplessly keeps saying that client application has to be resilient to all type of failures including such dead connection. However Lettuce isn't resilient to this problem and this brings us issue that after Azure maintenance we get downtime of ~15 minutes. We currently worked around this by setting up health probe and restarting application if Redis connection is broken for ~1 minute. Other option would be reduce TCP_RETRIES2 setting in Linux. However most reliable solution would be that Lettuce could re-establish connection if it get's x consecutive timeout errors on existing connection.

@mp911de
Copy link
Collaborator

mp911de commented Mar 2, 2021

How is the client-side supposed to know that Redis is not available when there's no FIN packet and TCP retransmit doesn't catch that the server is gone? Happy to improve on that end but I'm left without any ideas what else we could do.

@rimvydas-pranciulis-ruptela

One option I see is that Lettuce could try to create new connection after getting x consecutive timeout errors. Maybe would be worth to add some minimal time period to not close old connection too fast if there was only short connection issue.

Other option could be background thread sending periodic "ping" messages to Redis and try to reconnect to Redis once they start failing.

What are your thoughts? Azure told us that they aren't going to fix missing FIN packet issue.

@mp911de
Copy link
Collaborator

mp911de commented Mar 2, 2021

What are your thoughts? Azure told us that they aren't going to fix missing FIN packet issue.

Keepalive with a short timing is likely to be the appropriate fix here.

Lettuce could try to create new connection after getting x consecutive timeout errors

Imagine your connection has issued a BLPOP 0 foo and awaits a response while another thread tries to issue other commands. Because BLPOP is blocking the read, suddenly you would see a reconnect.

That doesn't seem right.

@nitin-patil1
Copy link

do we have any solution?

@fbeaufume
Copy link

fbeaufume commented Oct 12, 2021

Currently facing the same kind of issue from a Sping Boot 2.5.3 application (using Lettuce) to a Redis managed service in Azure: it takes about 15 minutes for the application to reconnect to Redis when Redis reboots. The actual reboot duration is much smaller and during these 15 minutes connections from a third party client such as RedisInsight work fine.

During the 15 minutes, the exceptions received by the application are: org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 16 second(s)

For the record, I get the same answers from Azure tech support than @rimvydas-pranciulis-ruptela , i.e. :

"Azure Support helplessly keeps saying that client application has to be resilient to all type of failures including such dead connection."

I also expected an improvement in Lettuce such as an optional flag meaning "when some queries timeout, drop the connection and create a new one" (even if it does not work in some cases such as the "BLPOP" example you gave).

@mp911de any chance such improvement could happen ?

@before30
Copy link

before30 commented Oct 14, 2021

I'm facing the same issue especially when connecting to other data centers.

And I found gRPC has the same issues. gRPC make a kind of self health check for connection and that make a new connection when ping fails.
https://cs.mcgill.ca/~mxia3/2019/02/23/Using-gRPC-in-Production/

I tried these code for health check and it works well.

LettuceConnectionFactory connectionFactory = (LettuceConnectionFactory)redisTemplate.getConnectionFactory();
try {
connectionFactory.getConnection().ping();
} catch (RedisCommandTimeoutException ex) {
connectionFactory.resetConnection();
}

@mp911de
Copy link
Collaborator

mp911de commented Nov 25, 2021

This issue is a mess already with all the comments that point into all sorts of directions.

Given that the library maintainers seem to have no interest in addressing this

The maintainers do not have an infinite amount of time. The maintainers do not have the possibility to buy accounts for each cloud. The maintainers would accept contributions that help to identify the problem. The maintainer would even accept a pull request.

Open source is not a one-way road where maintainers can be seen as an infinite resource. Open source lives from a community and a single person that should fix other people's problems is not a community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

6 participants