Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail fast at startup for invalid WAN merge policy #18875

Merged
merged 4 commits into from Jun 17, 2021

Conversation

ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented Jun 9, 2021

closes #18804

Modifications:

  • Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time.
  • Also adds a logging to MergeOperation to catch unreasonable merge-policy configs that cannot be catch at start-up. Example to this is, you have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case MergeOperation catches this anomaly and prints it as a warning.

@ahmetmircik ahmetmircik added Type: Enhancement Source: Community PR or issue was opened by a community user Module: IMap labels Jun 9, 2021
@ahmetmircik ahmetmircik added this to the 4.2.1 milestone Jun 10, 2021
@ahmetmircik ahmetmircik changed the title [WIP] Fail fast at startup for invalid WAN merge policy Fail fast at startup for invalid WAN merge policy Jun 14, 2021
@ahmetmircik ahmetmircik marked this pull request as ready for review June 14, 2021 08:39
@@ -77,6 +78,9 @@ protected boolean disableWanReplicationEvent() {

@Override
protected void runInternal() {
checkMapMergePolicy(mapContainer.getMapConfig(), mergePolicy.getClass().getName(),
getNodeEngine().getSplitBrainMergePolicyProvider());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we testing the case if this call fails? If I'm right, this fails in the WAN target cluster if it has stats disabled while the used merge policy relies on the stats. Can the split-brain healing get here with an invalid config? As I get it, with the recent changes we fail-fast on startup. Or can a RU config change drive us here and fail it during split-brain healing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added PR description to explain this.
You are right that this check can fail after a RU config change but not sure what can be the other expected behavior in such a case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can get here. I don't think we do any checks for dynamically added configs, so you can add invalid config and get here after split-brain. Don't know about the other questions.

I'm also curious about a related topic. Is this operation failed or retried after this throws an exception? Wondering if we ever give up, both in split-brain healing and WAN, or do we keep filling the logs and retrying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahmetmircik which RU config is this? And what does this check cover? I forgot.
I'm wondering if it's something we can simply avoid checking for now, and just check WAN merge policies at startup, not runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It covers the scenario i mentioned in PR description: You have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case having same merge-policy in both sides doesn't make sense(since it can require per-entry-stats-enabled on both sides) hence we catch this anomaly. I see this as an edge case but still possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's the RU case that Zoltan was mentioning? Someone having this kind of setup and doing RU from 4.2.0 to 4.2.1 would start getting failures? In this case we assume that the 4.2.0 policy would simply always choose one of the two entries, because one of the values is always -1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about SB, but WAN expects the replicated event to succeed. We keep sending it until we get an ACK. So the clusters will start playing a neverending ping-pong if this check fails. Well, if whatever fails on the target. The only solution I can think of is extending the WAN protocol with a feature to respond with "hey, the merge policy you try to use for DS XXX doesn't work for me, stop replicating that DS".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex :) I'm leaning towards simply removing this check. For now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't mean to do this change in a patch release :) We'll fail here (or soon after the check) and the situation is the same. So I'd opt for the most accurate error message. Maybe we can optionally delay the NAK to prevent the source cluster to overwhelm the target and leave time for the ops to intervene. Need to check how it can be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm for removing the check but adding a log message.

@hazelcast hazelcast deleted a comment from hz-devops-test Jun 16, 2021
@hazelcast hazelcast deleted a comment from hz-devops-test Jun 16, 2021
@mmedenjak mmedenjak added Module: WAN Team: Core Source: Internal PR or issue was opened by an employee and removed Source: Community PR or issue was opened by a community user labels Jun 16, 2021
@mmedenjak
Copy link
Contributor

mmedenjak commented Jun 16, 2021

Does this fix #18804?

Also, can you add a description to the PR describing the new behaviour?

Copy link
Contributor

@mmedenjak mmedenjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, mostly curious if we can avoid the check in the merge operation for now.

@ahmetmircik ahmetmircik merged commit 1e973be into hazelcast:4.2.z Jun 17, 2021
@ahmetmircik ahmetmircik deleted the fix/4.2/wanFailFast branch June 17, 2021 12:15
ahmetmircik added a commit to ahmetmircik/hazelcast that referenced this pull request Jun 17, 2021
Modifications:
- Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time.
- Print a warning log at execution phase of MergeOperation to catch unreasonable merge-policy 
configs that cannot be catch at start-up time. Example to this is, you have WAN replication between 
2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it 
enabled. In this case, MergeOperation catches this anomaly and prints it as a warning.
mmedenjak pushed a commit that referenced this pull request Jun 17, 2021
Modifications:
- Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time.
- Print a warning log at execution phase of MergeOperation to catch unreasonable merge-policy 
configs that cannot be catch at start-up time. Example to this is, you have WAN replication between 
2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it 
enabled. In this case, MergeOperation catches this anomaly and prints it as a warning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants