Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct retries to another mongos if one is available #1367

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

stIncMale
Copy link
Member

@stIncMale stIncMale commented Apr 17, 2024

Given that the current specification wording is internally inconsistent, I confirmed the actual intent with the spec author and created DRIVERS-2901 Clarify the intent behind the list of deprioritized mongos'es and fix the pseudocode.

Performance considerations

Our ServerSelector and ClusterDescription API do not allow us to implement an efficient pipeline (CompositeServerSelector) of ServerSelectors: we can neither mutate the List<ServerDescription> in place, nor mutate ClusterDescription, nor even reuse the same ClusterDescription if a selector did not filter anything out. This PR added one more selector required by the specification logic, and introduced two more selectors by refactoring server selection code that wasn't expressed in terms of ServerSelectors. As a result, it is conceivable that the PR has negative performance impact.

Additionally, due to this refactoring d25010d, each server selection iteration now involves copying the map of Servers maintained by a Cluster. While that copying does not entail locking (all hail the CHM!), it may still have additional negative performance impact.

If we indeed notice performance degradation, we may try mitigating the issue by introducing InternalServerSelector extends ServerSelector (or a subclass of ClusterDescription that allows mutating getServerDescriptions, or both), which allows for a more optimal chaining, and use it for everything but the application-specific selector. This is assuming, of course, that the would be degradation is caused to a large extent by the inefficient selector chaining, and not by the CHM copying.

JAVA-4254, JAVA-5320

Implemented the change and unit tests

JAVA-4254, JAVA-5320
@stIncMale stIncMale self-assigned this Apr 17, 2024
@stIncMale stIncMale requested review from katcharov, a team and vbabanin and removed request for a team April 17, 2024 01:19
@stIncMale stIncMale requested a review from vbabanin April 30, 2024 05:59
(@Nullable Throwable previouslyChosenException, Throwable mostRecentAttemptException) ->
CommandOperationHelper.onRetryableReadAttemptFailure(
operationContext, previouslyChosenException, mostRecentAttemptException);
return new RetryingAsyncCallbackSupplier<>(retryState, onAttemptFailure,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it difficult to understand what is happening in these methods, though I am unsure why. I think part of it is the amount of boilerplate: onAttemptFailure = (priorException, currentException) ->, and then inlining, might help.

The other, more objective issue is that onAttemptFailure is being used as a failedResultTransformer, but I would expect such a thing to have no side effects (such as deprioritizing a server in OperationContext).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the docs to reflect that side effects are allowed. The key to them being allowed is that the operator is guaranteed to be called once per failed attempt.

I find it difficult to understand what is happening in these methods...
...inlining, might help...

Could you please clarify what you propose to inline and where?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am imagining at least something like:

    static <R> AsyncCallbackSupplier<R> decorateReadWithRetriesAsync(final RetryState retryState, 
            final OperationContext operationContext, final AsyncCallbackSupplier<R> asyncReadFunction) {
        return new RetryingAsyncCallbackSupplier<>(retryState,
                onRetryableReadAttemptFailure(operationContext),
                CommandOperationHelper::shouldAttemptToRetryRead,
                logRetryAndGet(retryState, operationContext, asyncReadFunction));
    }

This is optional for this PR, and perhaps should be undertaken separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a code extraction into a method, rather than inlining. Done in 69b78ff.

…Context.java


Let's put various checks (validation, preconditions...) at the top of methods, with the operation at the bottom.

Co-authored-by: Maxim Katcharov <maxim.katcharov@mongodb.com>
Copy link
Member

@vbabanin vbabanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

.filter(serverDescription -> serversSnapshot.containsServer(serverDescription.getAddress()))
.collect(toList());
List<ServerSelector> selectors = Stream.of(
raceConditionPreFiltering,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and selector can be moved to their own method:

Suggested change
raceConditionPreFiltering,
inSnapshotSelector(serversSnapshot),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If getCompleteServerSelector gets too big in the future, that may be helpful. For now, however, it does not seem to be needed.

// are of those `Server`s that are known to both `clusterDescription` and `serversSnapshot`.
// This way we are guaranteed to successfully get `Server`s from `serversSnapshot` based on the selected `ServerDescription`s.
//
// The pre-filtering we do to deal with the race condition described above is achieved by this `ServerSelector`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this comment should be in a PR comment, rather than in the code, and that there should be a test that ensures items missing from the snapshot get filtered out by this chain. The problem may have been complicated to identify and solve, but now that there is a solution, all the facts mentioned in this comment seem like they should be evident to unfamiliar readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be in the code, not somewhere else, because it explains the code, that is not that obvious otherwise.

but now that there is a solution

Note that it has been solved previously, i.e., it is solved in the master, just differently.

all the facts mentioned in this comment seem like they should be evident to unfamiliar readers

This is a huge overstatement in my opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a test that ensures items missing from the snapshot get filtered out by this chain

Done in 246353f.

Comment on lines +355 to +356
ServerSelector raceConditionPreFiltering = clusterDescriptionPotentiallyInconsistentWithServerSnapshot ->
clusterDescriptionPotentiallyInconsistentWithServerSnapshot.getServerDescriptions()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ServerSelector raceConditionPreFiltering = clusterDescriptionPotentiallyInconsistentWithServerSnapshot ->
clusterDescriptionPotentiallyInconsistentWithServerSnapshot.getServerDescriptions()
ServerSelector raceConditionPreFiltering = clusterDescription ->
clusterDescription.getServerDescriptions()

We would not bother to put a comment saying "this is potentially inconsistent with the server snapshot".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a local variable here that has a long name explaining what it is, and the variable is used only once. Why do you think that the generic clusterDescription name is better here?

serverSelector,
serverDeprioritization.getServerSelector(),
settings.getServerSelector(), // may be null
new LatencyMinimizingServerSelector(settings.getLocalThreshold(MILLISECONDS), MILLISECONDS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just pass the settings in to a constructor, and the constructor can figure out which settings are important to latency minimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR hasn't introduced this constructor, i.e., the instance is created exactly as it was created before the PR. I don't see a reason to make the change in this PR.

* <p>This class is not part of the public API and may be removed or changed at any time</p>
*/
@ThreadSafe
public final class OperationCountMinimizingServerSelector implements ServerSelector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like MinimumOperationCountServerSelector would convey that this returns at most 1 result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 246353f.

}

void updateCandidate(final ServerAddress serverAddress, final ClusterType clusterType) {
candidate = isEnabled(clusterType) ? serverAddress : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
candidate = isEnabled(clusterType) ? serverAddress : null;
candidate = serverAddress;

I do not think the check is needed here. This potential "server on which the operation failed" will already be correctly excluded, because the ServerDeprioritization server selector is already a no-op when not sharded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about this, I agree that it's fine to omit the check in the updateCandidate method.

Done in 246353f.

Comment on lines 104 to 113
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions();
if (!isEnabled(clusterDescription.getType())) {
return serverDescriptions;
} else {
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions
.stream()
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress()))
.collect(toList());
return nonDeprioritizedServerDescriptions.isEmpty() ? serverDescriptions : nonDeprioritizedServerDescriptions;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions();
if (!isEnabled(clusterDescription.getType())) {
return serverDescriptions;
} else {
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions
.stream()
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress()))
.collect(toList());
return nonDeprioritizedServerDescriptions.isEmpty() ? serverDescriptions : nonDeprioritizedServerDescriptions;
}
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions();
if (!isEnabled(clusterDescription.getType())) {
return serverDescriptions;
}
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions
.stream()
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress()))
.collect(toList());
return nonDeprioritizedServerDescriptions.isEmpty()
? serverDescriptions : nonDeprioritizedServerDescriptions;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 246353f.

(@Nullable Throwable previouslyChosenException, Throwable mostRecentAttemptException) ->
CommandOperationHelper.onRetryableReadAttemptFailure(
operationContext, previouslyChosenException, mostRecentAttemptException);
return new RetryingAsyncCallbackSupplier<>(retryState, onAttemptFailure,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am imagining at least something like:

    static <R> AsyncCallbackSupplier<R> decorateReadWithRetriesAsync(final RetryState retryState, 
            final OperationContext operationContext, final AsyncCallbackSupplier<R> asyncReadFunction) {
        return new RetryingAsyncCallbackSupplier<>(retryState,
                onRetryableReadAttemptFailure(operationContext),
                CommandOperationHelper::shouldAttemptToRetryRead,
                logRetryAndGet(retryState, operationContext, asyncReadFunction));
    }

This is optional for this PR, and perhaps should be undertaken separately.

@stIncMale stIncMale requested a review from katcharov May 8, 2024 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants