New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail fast at startup for invalid WAN merge policy #18875
Conversation
@@ -77,6 +78,9 @@ protected boolean disableWanReplicationEvent() { | |||
|
|||
@Override | |||
protected void runInternal() { | |||
checkMapMergePolicy(mapContainer.getMapConfig(), mergePolicy.getClass().getName(), | |||
getNodeEngine().getSplitBrainMergePolicyProvider()); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we testing the case if this call fails? If I'm right, this fails in the WAN target cluster if it has stats disabled while the used merge policy relies on the stats. Can the split-brain healing get here with an invalid config? As I get it, with the recent changes we fail-fast on startup. Or can a RU config change drive us here and fail it during split-brain healing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added PR description to explain this.
You are right that this check can fail after a RU config change but not sure what can be the other expected behavior in such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can get here. I don't think we do any checks for dynamically added configs, so you can add invalid config and get here after split-brain. Don't know about the other questions.
I'm also curious about a related topic. Is this operation failed or retried after this throws an exception? Wondering if we ever give up, both in split-brain healing and WAN, or do we keep filling the logs and retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmetmircik which RU config is this? And what does this check cover? I forgot.
I'm wondering if it's something we can simply avoid checking for now, and just check WAN merge policies at startup, not runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It covers the scenario i mentioned in PR description: You have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case having same merge-policy in both sides doesn't make sense(since it can require per-entry-stats-enabled on both sides) hence we catch this anomaly. I see this as an edge case but still possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's the RU case that Zoltan was mentioning? Someone having this kind of setup and doing RU from 4.2.0 to 4.2.1 would start getting failures? In this case we assume that the 4.2.0 policy would simply always choose one of the two entries, because one of the values is always -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about SB, but WAN expects the replicated event to succeed. We keep sending it until we get an ACK. So the clusters will start playing a neverending ping-pong if this check fails. Well, if whatever fails on the target. The only solution I can think of is extending the WAN protocol with a feature to respond with "hey, the merge policy you try to use for DS XXX doesn't work for me, stop replicating that DS".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complex :) I'm leaning towards simply removing this check. For now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't mean to do this change in a patch release :) We'll fail here (or soon after the check) and the situation is the same. So I'd opt for the most accurate error message. Maybe we can optionally delay the NAK to prevent the source cluster to overwhelm the target and leave time for the ops to intervene. Need to check how it can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm for removing the check but adding a log message.
Does this fix #18804? Also, can you add a description to the PR describing the new behaviour? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, mostly curious if we can avoid the check in the merge operation for now.
a60d504
to
f70fa31
Compare
Modifications: - Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time. - Print a warning log at execution phase of MergeOperation to catch unreasonable merge-policy configs that cannot be catch at start-up time. Example to this is, you have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case, MergeOperation catches this anomaly and prints it as a warning.
Modifications: - Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time. - Print a warning log at execution phase of MergeOperation to catch unreasonable merge-policy configs that cannot be catch at start-up time. Example to this is, you have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case, MergeOperation catches this anomaly and prints it as a warning.
closes #18804
Modifications: