Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

Closed
jeremy-ihs opened this issue Sep 22, 2022 · 4 comments

Comments

@jeremy-ihs
Copy link

Expected behavior
Redisson should auto-discover new nodes as topology changes and resume normal operation once all nodes are alive
Actual behavior
Redisson throws errors and is unable to recover without JVM restart:
... caused by: org.redisson.client.RedisTimeoutException: Unable to acquire subscription lock after 39000ms. Try to increase 'timeout', 'subscriptionsPerConnection', 'subscriptionConnectionPoolSize' parameters. at org.redisson.pubsub.PublishSubscribeService.lambda$timeout$7(PublishSubscribeService.java:241) ~[redisson-3.17.6.jar:3.17.6] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] ... 1 more

Steps to reproduce or test case

  1. Connect to elasticache redis with cluster mode enabled (my use case utilizes RLocks as well as RLocalCachedMaps)
  2. Modify the cluster node type for example cache.t3.small -> cache.t3.medium (to simulate cluster failure, full topology change)

Redis version
6.2.6
Redisson version
3.17.6

Redisson configuration
ClusterServersConfig clusterServers = config.useClusterServers() .setRetryInterval(3000) .setTimeout(30000) .setReadMode(ReadMode.MASTER_SLAVE)
Utilizing a "rediss://" connection string which points to the elasticache configuration endpoint provided by AWS, with TLS enabled
Remaining config values are defaults

@jeremy-ihs jeremy-ihs changed the title Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure) Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) Sep 22, 2022
@jeremy-ihs
Copy link
Author

jeremy-ihs commented Sep 28, 2022

Here is some additional info, stack traces on logs coming out of redisson during RLock usage:

2022-09-28 18:39:53,180 WARN (redisson-timer-4-1) [io.netty.util.HashedWheelTimer] [containerId=74bec3759bd2486e84f79ed7d84c19a6] An exception was thrown by TimerTask.: java.lang.NullPointerException at org.redisson.RedissonBaseLock.evalWriteAsync(RedissonBaseLock.java:210) [redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonBaseLock.renewExpirationAsync(RedissonBaseLock.java:179) [redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonBaseLock$1.run(RedissonBaseLock.java:140) [redisson-3.17.6.jar:3.17.6] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.79.Final.jar:4.1.79.Final] at java.lang.Thread.run(Thread.java:750) [rt.jar:1.8.0_342]

Also,

java.lang.NullPointerException: null at org.redisson.RedissonBaseLock.evalWriteAsync(RedissonBaseLock.java:210) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockInnerAsync(RedissonLock.java:198) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryAcquireOnceAsync(RedissonLock.java:152) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockAsync(RedissonLock.java:463) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockAsync(RedissonLock.java:458) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLock(RedissonLock.java:194) ~[redisson-3.17.6.jar:3.17.6]

And finally,

2022-09-28 18:44:18,940 ERROR (redisson-netty-2-11) [org.redisson.cluster.ClusterConnectionManager] [containerId=74bec3759bd2486e84f79ed7d84c19a6] Unable to execute (CLUSTER NODES): org.redisson.client.RedisLoadingException: LOADING Redis is loading the dataset in memory. channel: [id: 0x598a0711, L:/10.227.164.188:57636 - R:10.227.164.172/10.227.164.172:6379] data: CommandData [promise=java.util.concurrent.CompletableFuture@67ad4343[Not completed, 2 dependents], command=(CLUSTER NODES), params=[], codec=null] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:351) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decodeCommand(CommandDecoder.java:198) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:137) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:113) [redisson-3.17.6.jar:3.17.6] at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1373) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.79.Final.jar:4.1.79.Final] at java.lang.Thread.run(Thread.java:750) [rt.jar:1.8.0_342]

@mrniko
Copy link
Member

mrniko commented Nov 17, 2022

Did you try 3.17.7 version?

@jeremy-ihs
Copy link
Author

Yes - 3.17.7 did remove the NullPointerException, so that's progress, thank you for that. However I am still seeing problems with cluster maintenance and failover, I think its the same issue here: #4653

@mrniko
Copy link
Member

mrniko commented Nov 30, 2022

Yes - 3.17.7 did remove the NullPointerException, so that's progress, thank you for that. However I am still seeing problems with cluster maintenance and failover, I think its the same issue here: #4653

This issue was fixed in 3.18.1 version. Please try it

@mrniko mrniko closed this as completed Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants