fix: Improve cluster connection pool logic when disconnecting #1864
+101
−54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Background
This is an attempt to fix errors occurring when a
connect()
call is made shortly after adisconnect()
, which is something that the Bull library does when pausing a queue.Here's a relatively minimal way to reproduce an error:
Running that script in a loop using
against the
main
branch ofioredis
quickly results in this output:My debugging led me to believe that the existing node cleanup logic in the
ConnectionPool
class leads to race conditions: upondisconnect()
, the this.connectionPool.reset() call will remove nodes from the pool without cleaning up the event listener which may then subsequently issue more than onedrain
event. Depending on timing, one of the extradrain
events may fire afterconnect()
and change the status toclose
, interfering with the connection attempt and leading to the error above.Changes
ConnectionPool
class and remove them from the nodes whenever they are removed from the pool.-node
/drain
regardless of whether nodes disconnected or were removed through areset()
call.reset()
, add nodes before removing old ones to avoid unwanteddrain
events.this
point to the connection pool instance.main
is seemingly different from the error shown above but it still seems related to the disconnection logic and still gets fixed by the changes in this PR.