Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

jeremy-ihs · 2022-09-22T20:07:52Z

Expected behavior
Redisson should auto-discover new nodes as topology changes and resume normal operation once all nodes are alive
Actual behavior
Redisson throws errors and is unable to recover without JVM restart:
... caused by: org.redisson.client.RedisTimeoutException: Unable to acquire subscription lock after 39000ms. Try to increase 'timeout', 'subscriptionsPerConnection', 'subscriptionConnectionPoolSize' parameters. at org.redisson.pubsub.PublishSubscribeService.lambda$timeout$7(PublishSubscribeService.java:241) ~[redisson-3.17.6.jar:3.17.6] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.79.Final.jar:4.1.79.Final] ... 1 more

Steps to reproduce or test case

Connect to elasticache redis with cluster mode enabled (my use case utilizes RLocks as well as RLocalCachedMaps)
Modify the cluster node type for example cache.t3.small -> cache.t3.medium (to simulate cluster failure, full topology change)

Redis version
6.2.6
Redisson version
3.17.6

Redisson configuration
ClusterServersConfig clusterServers = config.useClusterServers() .setRetryInterval(3000) .setTimeout(30000) .setReadMode(ReadMode.MASTER_SLAVE)
Utilizing a "rediss://" connection string which points to the elasticache configuration endpoint provided by AWS, with TLS enabled
Remaining config values are defaults

The text was updated successfully, but these errors were encountered:

jeremy-ihs · 2022-09-28T18:49:11Z

Here is some additional info, stack traces on logs coming out of redisson during RLock usage:

2022-09-28 18:39:53,180 WARN (redisson-timer-4-1) [io.netty.util.HashedWheelTimer] [containerId=74bec3759bd2486e84f79ed7d84c19a6] An exception was thrown by TimerTask.: java.lang.NullPointerException at org.redisson.RedissonBaseLock.evalWriteAsync(RedissonBaseLock.java:210) [redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonBaseLock.renewExpirationAsync(RedissonBaseLock.java:179) [redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonBaseLock$1.run(RedissonBaseLock.java:140) [redisson-3.17.6.jar:3.17.6] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.79.Final.jar:4.1.79.Final] at java.lang.Thread.run(Thread.java:750) [rt.jar:1.8.0_342]

Also,

java.lang.NullPointerException: null at org.redisson.RedissonBaseLock.evalWriteAsync(RedissonBaseLock.java:210) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockInnerAsync(RedissonLock.java:198) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryAcquireOnceAsync(RedissonLock.java:152) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockAsync(RedissonLock.java:463) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLockAsync(RedissonLock.java:458) ~[redisson-3.17.6.jar:3.17.6] at org.redisson.RedissonLock.tryLock(RedissonLock.java:194) ~[redisson-3.17.6.jar:3.17.6]

And finally,

2022-09-28 18:44:18,940 ERROR (redisson-netty-2-11) [org.redisson.cluster.ClusterConnectionManager] [containerId=74bec3759bd2486e84f79ed7d84c19a6] Unable to execute (CLUSTER NODES): org.redisson.client.RedisLoadingException: LOADING Redis is loading the dataset in memory. channel: [id: 0x598a0711, L:/10.227.164.188:57636 - R:10.227.164.172/10.227.164.172:6379] data: CommandData [promise=java.util.concurrent.CompletableFuture@67ad4343[Not completed, 2 dependents], command=(CLUSTER NODES), params=[], codec=null] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:351) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decodeCommand(CommandDecoder.java:198) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:137) [redisson-3.17.6.jar:3.17.6] at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:113) [redisson-3.17.6.jar:3.17.6] at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1373) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285) [netty-handler-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.79.Final.jar:4.1.79.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.79.Final.jar:4.1.79.Final] at java.lang.Thread.run(Thread.java:750) [rt.jar:1.8.0_342]

mrniko · 2022-11-17T07:58:01Z

Did you try 3.17.7 version?

jeremy-ihs · 2022-11-17T19:13:11Z

Yes - 3.17.7 did remove the NullPointerException, so that's progress, thank you for that. However I am still seeing problems with cluster maintenance and failover, I think its the same issue here: #4653

mrniko · 2022-11-30T10:35:37Z

Yes - 3.17.7 did remove the NullPointerException, so that's progress, thank you for that. However I am still seeing problems with cluster maintenance and failover, I think its the same issue here: #4653

This issue was fixed in 3.18.1 version. Please try it

jeremy-ihs changed the title ~~Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure)~~ Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) Sep 22, 2022

jeremy-ihs mentioned this issue Sep 29, 2022

Exceptions when updating redis engine in AWS Elasticache on the fly #4567

Closed

mrniko pushed a commit that referenced this issue Sep 30, 2022

Fixed - RedissonBaseLock throws NPE. #4555

a5502b6

mrniko closed this as completed Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

jeremy-ihs commented Sep 22, 2022

jeremy-ihs commented Sep 28, 2022 •

edited

mrniko commented Nov 17, 2022

jeremy-ihs commented Nov 17, 2022

mrniko commented Nov 30, 2022

Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

Redisson is not able to recover from elasticache cluster mode node type change (emulating cluster failure/maintenance) #4555

Comments

jeremy-ihs commented Sep 22, 2022

jeremy-ihs commented Sep 28, 2022 • edited

mrniko commented Nov 17, 2022

jeremy-ihs commented Nov 17, 2022

mrniko commented Nov 30, 2022

jeremy-ihs commented Sep 28, 2022 •

edited