New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct retries to another mongos if one is available #1367
base: master
Are you sure you want to change the base?
Conversation
Implemented the change and unit tests JAVA-4254, JAVA-5320
JAVA-4254, JAVA-5320
JAVA-4254, JAVA-5320
JAVA-4254, JAVA-5320
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/operation/AsyncOperationHelper.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/BaseCluster.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
(@Nullable Throwable previouslyChosenException, Throwable mostRecentAttemptException) -> | ||
CommandOperationHelper.onRetryableReadAttemptFailure( | ||
operationContext, previouslyChosenException, mostRecentAttemptException); | ||
return new RetryingAsyncCallbackSupplier<>(retryState, onAttemptFailure, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it difficult to understand what is happening in these methods, though I am unsure why. I think part of it is the amount of boilerplate: onAttemptFailure = (priorException, currentException) ->
, and then inlining, might help.
The other, more objective issue is that onAttemptFailure
is being used as a failedResultTransformer
, but I would expect such a thing to have no side effects (such as deprioritizing a server in OperationContext).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the docs to reflect that side effects are allowed. The key to them being allowed is that the operator is guaranteed to be called once per failed attempt.
I find it difficult to understand what is happening in these methods...
...inlining, might help...
Could you please clarify what you propose to inline and where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am imagining at least something like:
static <R> AsyncCallbackSupplier<R> decorateReadWithRetriesAsync(final RetryState retryState,
final OperationContext operationContext, final AsyncCallbackSupplier<R> asyncReadFunction) {
return new RetryingAsyncCallbackSupplier<>(retryState,
onRetryableReadAttemptFailure(operationContext),
CommandOperationHelper::shouldAttemptToRetryRead,
logRetryAndGet(retryState, operationContext, asyncReadFunction));
}
This is optional for this PR, and perhaps should be undertaken separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a code extraction into a method, rather than inlining. Done in 69b78ff.
…Context.java Let's put various checks (validation, preconditions...) at the top of methods, with the operation at the bottom. Co-authored-by: Maxim Katcharov <maxim.katcharov@mongodb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
JAVA-4254
… side effects JAVA-4254
…tion The new approach allows us to later refactor all other logic inside one or more `ServerSelector`s. See the comment left in the code for more details on the new approach. JAVA-4254
…verSelector` JAVA-4254
…uirements JAVA-4254
JAVA-4254
.filter(serverDescription -> serversSnapshot.containsServer(serverDescription.getAddress())) | ||
.collect(toList()); | ||
List<ServerSelector> selectors = Stream.of( | ||
raceConditionPreFiltering, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment and selector can be moved to their own method:
raceConditionPreFiltering, | |
inSnapshotSelector(serversSnapshot), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If getCompleteServerSelector
gets too big in the future, that may be helpful. For now, however, it does not seem to be needed.
// are of those `Server`s that are known to both `clusterDescription` and `serversSnapshot`. | ||
// This way we are guaranteed to successfully get `Server`s from `serversSnapshot` based on the selected `ServerDescription`s. | ||
// | ||
// The pre-filtering we do to deal with the race condition described above is achieved by this `ServerSelector`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this comment should be in a PR comment, rather than in the code, and that there should be a test that ensures items missing from the snapshot get filtered out by this chain. The problem may have been complicated to identify and solve, but now that there is a solution, all the facts mentioned in this comment seem like they should be evident to unfamiliar readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment should be in the code, not somewhere else, because it explains the code, that is not that obvious otherwise.
but now that there is a solution
Note that it has been solved previously, i.e., it is solved in the master
, just differently.
all the facts mentioned in this comment seem like they should be evident to unfamiliar readers
This is a huge overstatement in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a test that ensures items missing from the snapshot get filtered out by this chain
Done in 246353f.
ServerSelector raceConditionPreFiltering = clusterDescriptionPotentiallyInconsistentWithServerSnapshot -> | ||
clusterDescriptionPotentiallyInconsistentWithServerSnapshot.getServerDescriptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ServerSelector raceConditionPreFiltering = clusterDescriptionPotentiallyInconsistentWithServerSnapshot -> | |
clusterDescriptionPotentiallyInconsistentWithServerSnapshot.getServerDescriptions() | |
ServerSelector raceConditionPreFiltering = clusterDescription -> | |
clusterDescription.getServerDescriptions() |
We would not bother to put a comment saying "this is potentially inconsistent with the server snapshot".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a local variable here that has a long name explaining what it is, and the variable is used only once. Why do you think that the generic clusterDescription
name is better here?
serverSelector, | ||
serverDeprioritization.getServerSelector(), | ||
settings.getServerSelector(), // may be null | ||
new LatencyMinimizingServerSelector(settings.getLocalThreshold(MILLISECONDS), MILLISECONDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just pass the settings in to a constructor, and the constructor can figure out which settings are important to latency minimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR hasn't introduced this constructor, i.e., the instance is created exactly as it was created before the PR. I don't see a reason to make the change in this PR.
* <p>This class is not part of the public API and may be removed or changed at any time</p> | ||
*/ | ||
@ThreadSafe | ||
public final class OperationCountMinimizingServerSelector implements ServerSelector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like MinimumOperationCountServerSelector
would convey that this returns at most 1 result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 246353f.
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
driver-core/src/main/com/mongodb/internal/connection/OperationContext.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
void updateCandidate(final ServerAddress serverAddress, final ClusterType clusterType) { | ||
candidate = isEnabled(clusterType) ? serverAddress : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
candidate = isEnabled(clusterType) ? serverAddress : null; | |
candidate = serverAddress; |
I do not think the check is needed here. This potential "server on which the operation failed" will already be correctly excluded, because the ServerDeprioritization
server selector is already a no-op when not sharded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more about this, I agree that it's fine to omit the check in the updateCandidate
method.
Done in 246353f.
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions(); | ||
if (!isEnabled(clusterDescription.getType())) { | ||
return serverDescriptions; | ||
} else { | ||
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions | ||
.stream() | ||
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress())) | ||
.collect(toList()); | ||
return nonDeprioritizedServerDescriptions.isEmpty() ? serverDescriptions : nonDeprioritizedServerDescriptions; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions(); | |
if (!isEnabled(clusterDescription.getType())) { | |
return serverDescriptions; | |
} else { | |
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions | |
.stream() | |
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress())) | |
.collect(toList()); | |
return nonDeprioritizedServerDescriptions.isEmpty() ? serverDescriptions : nonDeprioritizedServerDescriptions; | |
} | |
List<ServerDescription> serverDescriptions = clusterDescription.getServerDescriptions(); | |
if (!isEnabled(clusterDescription.getType())) { | |
return serverDescriptions; | |
} | |
List<ServerDescription> nonDeprioritizedServerDescriptions = serverDescriptions | |
.stream() | |
.filter(serverDescription -> !deprioritized.contains(serverDescription.getAddress())) | |
.collect(toList()); | |
return nonDeprioritizedServerDescriptions.isEmpty() | |
? serverDescriptions : nonDeprioritizedServerDescriptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 246353f.
(@Nullable Throwable previouslyChosenException, Throwable mostRecentAttemptException) -> | ||
CommandOperationHelper.onRetryableReadAttemptFailure( | ||
operationContext, previouslyChosenException, mostRecentAttemptException); | ||
return new RetryingAsyncCallbackSupplier<>(retryState, onAttemptFailure, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am imagining at least something like:
static <R> AsyncCallbackSupplier<R> decorateReadWithRetriesAsync(final RetryState retryState,
final OperationContext operationContext, final AsyncCallbackSupplier<R> asyncReadFunction) {
return new RetryingAsyncCallbackSupplier<>(retryState,
onRetryableReadAttemptFailure(operationContext),
CommandOperationHelper::shouldAttemptToRetryRead,
logRetryAndGet(retryState, operationContext, asyncReadFunction));
}
This is optional for this PR, and perhaps should be undertaken separately.
JAVA-4254
Given that the current specification wording is internally inconsistent, I confirmed the actual intent with the spec author and created DRIVERS-2901 Clarify the intent behind the list of deprioritized mongos'es and fix the pseudocode.
Performance considerations
Our
ServerSelector
andClusterDescription
API do not allow us to implement an efficient pipeline (CompositeServerSelector
) ofServerSelector
s: we can neither mutate theList<ServerDescription>
in place, nor mutateClusterDescription
, nor even reuse the sameClusterDescription
if a selector did not filter anything out. This PR added one more selector required by the specification logic, and introduced two more selectors by refactoring server selection code that wasn't expressed in terms ofServerSelector
s. As a result, it is conceivable that the PR has negative performance impact.Additionally, due to this refactoring d25010d, each server selection iteration now involves copying the map of
Server
s maintained by aCluster
. While that copying does not entail locking (all hail the CHM!), it may still have additional negative performance impact.If we indeed notice performance degradation, we may try mitigating the issue by introducing
InternalServerSelector extends ServerSelector
(or a subclass ofClusterDescription
that allows mutatinggetServerDescriptions
, or both), which allows for a more optimal chaining, and use it for everything but the application-specific selector. This is assuming, of course, that the would be degradation is caused to a large extent by the inefficient selector chaining, and not by the CHM copying.JAVA-4254, JAVA-5320