Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure command details are populated for Cluster clients #1026

Merged
merged 1 commit into from Sep 14, 2021

Conversation

praboud-stripe
Copy link
Contributor

@praboud-stripe praboud-stripe commented Sep 9, 2021

I recently stumbled over a corner case in the clustered client which is sort of interesting:

Background

If the Redis nodes that the client is connecting to are degraded when CommandLoader is called, the command details hash returned will be {} - the client essentially doesn't understand what any of the commands' syntax is.

details = {}
nodes.each do |node|
details = fetch_command_details(node)
details.empty? ? next : break
end
details

(This is pretty rare in practice. The queries made by SlotLoader and NodeLoader immediately prior to this need to have worked - otherwise those will raise - but then all nodes must be unreachable when we're running CommandLoader.)

These command details are later used by Command to process a command into a key, so that the command can be routed to the right node. If the command details are empty and we don't understand the command syntax, Command.determine_first_key_position doesn't know which command argument to get the hash from, so Command.extract_first_key returns '':

def determine_first_key_position(command)
case command.first.to_s.downcase
when 'eval', 'evalsha', 'migrate', 'zinterstore', 'zunionstore' then 3
when 'object' then 2
when 'memory'
command[1].to_s.casecmp('usage').zero? ? 2 : 0
when 'scan', 'sscan', 'hscan', 'zscan'
determine_optional_key_position(command, 'match')
when 'xread', 'xreadgroup'
determine_optional_key_position(command, 'streams')
else
dig_details(command, :first_key_position).to_i

def extract_first_key(command)
i = determine_first_key_position(command)
return '' if i == 0

...and that means that Cluster.find_node_key returns nil, because we don't know where the command needs to be routed. This is handled relatively gracefully - find_node routes to a random node if we don't have a specific node to route to.

def find_node_key(command, primary_only: false)
key = @command.extract_first_key(command)
return if key.empty?

def find_node(node_key)
return @node.sample if node_key.nil?

However, routing commands randomly means most commands are incurring an extra hop, because the randomly selected node is most likely wrong, and will need to redirect the client to the correct node. This works, but is a bit slower, and unnecessarily takes a dependency on an extra node (ie: if either of the random node or the node our data is on is degraded, the command will fail). The client gets wedged in this wonky state for its entire lifetime, because there's no mechanism to later update the command details. Because clients are intended to be long-lived, for most use cases, this means until the process is restarted.

Change Summary

This patch changes the client to require that the initial fetch of command details succeeds, and raises a CannotConnectError if that's not possible. Users should already handle this error, and retry creating the client later, or use some other suitable failure handling mechanism. Since SlotLoader and NodeLoader can already raise CannotConnectError, this shouldn't be a breaking change to the gem's external interface.

@praboud-stripe
Copy link
Contributor Author

It's my first time contributing, so apologies if there's a set of instructions I missed somewhere, but what's next for getting this merged? AFAICT, a maintainer needs to kick off the CI build (I can't do this myself because this is my first PR to the repo), and then somebody with write access to merge the PR (assuming the build passes).

@byroot
Copy link
Collaborator

byroot commented Sep 14, 2021

Yeah, my fault, I should have reviewed this a while ago. I'll merge in a few minutes assuming the CI is green.

Thanks for your patch.

@praboud-stripe
Copy link
Contributor Author

No worries, and thanks!

@praboud-stripe
Copy link
Contributor Author

Agh, there's a lint error - lemme send a fix.

Previously, if a clustered client is unable to query any of the nodes
for command details during setup, it'd initialize successfully, but
without any command details. (This is a bit of a corner case - in order
to hit this case, at least one node needs to be reachable when querying
nodes + slots, but all nodes would need to be unreachable immediately
afterwards when querying command details.)
In this case, the client would be unable to determine which node a
given command should be routed to. Because the client falls back to
routing commands to a random node, and following redirects as
necessary, this wouldn't cause any overt problems, but leaves the client
following unnecessary redirects for the rest of the lifespan of the
client.

This patch changes the client to require that the initial fetch of
command details succeeds (as is already done with slot and node detail
fetching), and raises a CannotConnectError if that's not possible.
Users should already handle this error, and retry creating the client
later, or use some other suitable failure handling mechanism.
@byroot byroot merged commit 5c4f2e8 into redis:master Sep 14, 2021
@praboud-stripe praboud-stripe deleted the cluster-command branch September 14, 2021 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants