Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking calls not working as expected in the case of disconnections #610

Open
manast opened this issue Mar 25, 2018 · 13 comments
Open

Blocking calls not working as expected in the case of disconnections #610

manast opened this issue Mar 25, 2018 · 13 comments
Assignees

Comments

@manast
Copy link

manast commented Mar 25, 2018

We are having a serious issue in bull (OptimalBits/bull#890), where the queue stops processing commands in the event of disconnections. I have tracked it down to be an issue in ioredis. It seems that blocking commands are not handled properly in the case of disconnections. It is very easy to reproduce, but there are many cases to consider. Here I report the most obvious ones.

Code to reproduce:

const Redis = require('ioredis');

const redis = new Redis();

redis.brpoplpush('source', 'destination', 10).then(function(result){
  console.log(result)
}, function(err){
  console.error(err);
});

redis.on('error', function(err){
  // Outcommented to avoid noise.
  / /console.log('ERROR EVENT', err);
});

Case 1. Disconnect before calling blocking command.

Behaviour
Dangling call, nothing happens for ever.
Expected
Error or at least timeout after given timeout.

Case 2. Connected before calling command, disconnected afterwards.

Behaviour
Dangling call, nothing happens for ever.
Expected
Error or at least timeout after given timeout.

Case 3. Connected before calling blocking command, disconnected and then reconnected.

Behaviour
Dangling call, nothing happens for ever.
Expected
Error or at least timeout after given timeout.

Case 4. Disconnected before calling blocking command, connected afterwards.

Behaviour
Timeout after 10 seconds after reconnection.

Expected
Works as expected?

Since the blocking command is not cancelable (#516), there is currently no workaround I know of for this, and you may end with a dangling client, so I think this issue is quite serious but please lets discuss it.

@luin
Copy link
Collaborator

luin commented Mar 30, 2018

Hi @manast.
For the case 4, I tested locally and the behavior is working as expected (logs the result when reconnected). That's strange that ioredis just times out when the source list has elements. Could you try to call LLEN source on a redis-cli when reconnected to see whether there are elements in the source list?

ioredis reconnecting to the server forever, so all commands will be blocking when disconnected. This behavior makes sense when it comes to an application that the connection will recover shortly (<10s~1min).

Setting retryStrategy to null and handling reconnection manually in the close event may solve the problem:

case 1: prints errors immediately.
case 2: prints errors immediately.
case 3: prints errors immediately when disconnected.
case 4: prints errors immediately.

@stale
Copy link

stale bot commented Apr 29, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Apr 29, 2018
@manast
Copy link
Author

manast commented Apr 29, 2018

bump

@stale stale bot removed the wontfix label Apr 29, 2018
@ks-s-a
Copy link

ks-s-a commented May 4, 2018

@luin It's quite strange - I use docker image with redis, istead of the pure local redis server. I tested local redis-server launching and bug doesn't exist in that case (external IP, protected-mode off).

We with my colleagues tested reconnection with product-like environment and couldn't repeat. We use kubernetes, may be it somehow affects the bug.

I'm not sure that I can investigate the issue further, I tried different redis options in the docker and local with the same result. It happens in docker container, but not with local redis server.

@manast
Copy link
Author

manast commented May 4, 2018

ok, I try to test again with a reproducible environment.

@lavarsicious
Copy link

We're definitely seeing this issue occur when using an Azure Redis instance. If I scale the service, Azure will disconnect any clients when it cuts over.

@stale
Copy link

stale bot commented Jun 14, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Jun 14, 2018
@manast
Copy link
Author

manast commented Jun 14, 2018

bump to avoid auto close

@stale stale bot removed the wontfix label Jun 14, 2018
@luin luin added the pinned label Jun 22, 2018
@carly
Copy link

carly commented Jul 16, 2018

@manast have you been able to come up with a good workaround for this issue? I'm using bull for an internal app I'm building at work. Everything works as expected on my dev box, but I'm running into this issue when I try to configure my app to redis instances on different hosts.

@manast
Copy link
Author

manast commented Jul 17, 2018

@carly not yet. I need to provide better test code for @luin but I did not have enough time for it, I will try to prioritize it.

@elucidsoft
Copy link

I use Kubernetes and have seen this issue. I think to re-create what you need to do is establish a healthy connection to redis, then kill your redis server, and send a command to it causing an exception, then start your redis server back up. Non-blocking calls will connect successfully, blocking calls will throw an exception. I can also confirm that this behavior even occurs if your using Sentinels, if you shutdown all of your sentinels the behavior is exactly the same.

@d0x2f
Copy link

d0x2f commented Jun 9, 2020

Is this still an active issue? It may explain problems we've been seeing (also in kubernetes).

@elucidsoft
Copy link

elucidsoft commented Jun 10, 2020

Is this still an active issue? It may explain problems we've been seeing (also in kubernetes).

I was able to resolve this but it required a TON of tweaking of my redis instances in Kubernetes, and code hacks. So it's solvable with a lot of work. FWIW, I ended up dumping my custom redis configuration and went with the Bitnami Helm chart. I made sure to set Sentinel.staticID: true, also I made sure to use sysctlImage to set net.core.somaxconn=10000 and transparent_hugepage/enabled.

Doing those things appears to have fixed this issue entirely, I have not seen it happen in over 6 months. I also changed the redis config options.

connectTimeout: 10000, sentinelRetryStrategy: () => Math.min(10 * 10, 1000)

In addition, based on my testing I noticed even with those changes it still appears to happen if the redis instance doesn't have enough memory or cpu resources. So I doubled those as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants