Fail fast at startup for invalid WAN merge policy #18875

ahmetmircik · 2021-06-09T16:23:04Z

Modifications:

Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time.
Also adds a logging to MergeOperation to catch unreasonable merge-policy configs that cannot be catch at start-up. Example to this is, you have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case MergeOperation catches this anomaly and prints it as a warning.

blazember · 2021-06-15T15:57:09Z

hazelcast/src/main/java/com/hazelcast/map/impl/operation/MergeOperation.java

@@ -77,6 +78,9 @@ protected boolean disableWanReplicationEvent() {

    @Override
    protected void runInternal() {
+        checkMapMergePolicy(mapContainer.getMapConfig(), mergePolicy.getClass().getName(),
+                getNodeEngine().getSplitBrainMergePolicyProvider());
+


Are we testing the case if this call fails? If I'm right, this fails in the WAN target cluster if it has stats disabled while the used merge policy relies on the stats. Can the split-brain healing get here with an invalid config? As I get it, with the recent changes we fail-fast on startup. Or can a RU config change drive us here and fail it during split-brain healing?

Added PR description to explain this.
You are right that this check can fail after a RU config change but not sure what can be the other expected behavior in such a case?

Probably we can get here. I don't think we do any checks for dynamically added configs, so you can add invalid config and get here after split-brain. Don't know about the other questions.

I'm also curious about a related topic. Is this operation failed or retried after this throws an exception? Wondering if we ever give up, both in split-brain healing and WAN, or do we keep filling the logs and retrying.

@ahmetmircik which RU config is this? And what does this check cover? I forgot.
I'm wondering if it's something we can simply avoid checking for now, and just check WAN merge policies at startup, not runtime.

It covers the scenario i mentioned in PR description: You have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case having same merge-policy in both sides doesn't make sense(since it can require per-entry-stats-enabled on both sides) hence we catch this anomaly. I see this as an edge case but still possible.

So that's the RU case that Zoltan was mentioning? Someone having this kind of setup and doing RU from 4.2.0 to 4.2.1 would start getting failures? In this case we assume that the 4.2.0 policy would simply always choose one of the two entries, because one of the values is always -1?

Not sure about SB, but WAN expects the replicated event to succeed. We keep sending it until we get an ACK. So the clusters will start playing a neverending ping-pong if this check fails. Well, if whatever fails on the target. The only solution I can think of is extending the WAN protocol with a feature to respond with "hey, the merge policy you try to use for DS XXX doesn't work for me, stop replicating that DS".

Complex :) I'm leaning towards simply removing this check. For now?

Yes, I didn't mean to do this change in a patch release :) We'll fail here (or soon after the check) and the situation is the same. So I'd opt for the most accurate error message. Maybe we can optionally delay the NAK to prevent the source cluster to overwhelm the target and leave time for the ops to intervene. Need to check how it can be done.

I'm for removing the check but adding a log message.

mmedenjak · 2021-06-16T06:28:59Z

Does this fix #18804?

Also, can you add a description to the PR describing the new behaviour?

mmedenjak

Overall looks good, mostly curious if we can avoid the check in the merge operation for now.

hazelcast/src/main/java/com/hazelcast/spi/impl/NodeEngineImpl.java

Modifications: - Validate wan and split-brain merge policies at node-engine creation time to fail-fast at node start time. - Print a warning log at execution phase of MergeOperation to catch unreasonable merge-policy configs that cannot be catch at start-up time. Example to this is, you have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case, MergeOperation catches this anomaly and prints it as a warning.

ahmetmircik added Type: Enhancement Source: Community PR or issue was opened by a community user Module: IMap labels Jun 9, 2021

ahmetmircik added this to the 4.2.1 milestone Jun 10, 2021

ahmetmircik changed the title ~~[WIP] Fail fast at startup for invalid WAN merge policy~~ Fail fast at startup for invalid WAN merge policy Jun 14, 2021

ahmetmircik marked this pull request as ready for review June 14, 2021 08:39

ahmetmircik requested review from mmedenjak and blazember June 14, 2021 13:09

blazember reviewed Jun 15, 2021

View reviewed changes

hazelcast deleted a comment from hz-devops-test Jun 16, 2021

mmedenjak added Module: WAN Team: Core Source: Internal PR or issue was opened by an employee and removed Source: Community PR or issue was opened by a community user labels Jun 16, 2021

mmedenjak reviewed Jun 16, 2021

View reviewed changes

hazelcast/src/main/java/com/hazelcast/spi/impl/NodeEngineImpl.java Show resolved Hide resolved

ahmetmircik added 4 commits June 16, 2021 14:11

wip

fca0a8a

wip test fixes

9c00223

wip test fixes

ddd1f93

add spaced merge policy check

f70fa31

ahmetmircik force-pushed the fix/4.2/wanFailFast branch from a60d504 to f70fa31 Compare June 16, 2021 13:43

mmedenjak approved these changes Jun 16, 2021

View reviewed changes

blazember approved these changes Jun 16, 2021

View reviewed changes

ahmetmircik merged commit 1e973be into hazelcast:4.2.z Jun 17, 2021

ahmetmircik deleted the fix/4.2/wanFailFast branch June 17, 2021 12:15

This was referenced Jun 17, 2021

[FORWARD PORT] Fail fast at startup for invalid WAN merge policy (#18875) #18928

Merged

InvalidConfigurationException is thrown at 'runtime' when a merge-policy(which requires per-entry statistics enabled) is configured on a Map (IMDG v4.2) #18804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail fast at startup for invalid WAN merge policy #18875

Fail fast at startup for invalid WAN merge policy #18875

ahmetmircik commented Jun 9, 2021 •

edited

blazember Jun 15, 2021

ahmetmircik Jun 16, 2021

mmedenjak Jun 16, 2021

mmedenjak Jun 16, 2021

ahmetmircik Jun 16, 2021 •

edited

mmedenjak Jun 16, 2021

blazember Jun 16, 2021

mmedenjak Jun 16, 2021

blazember Jun 16, 2021

ahmetmircik Jun 16, 2021

mmedenjak commented Jun 16, 2021 •

edited

mmedenjak left a comment

Fail fast at startup for invalid WAN merge policy #18875

Fail fast at startup for invalid WAN merge policy #18875

Conversation

ahmetmircik commented Jun 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmetmircik Jun 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmedenjak commented Jun 16, 2021 • edited

mmedenjak left a comment

Choose a reason for hiding this comment

ahmetmircik commented Jun 9, 2021 •

edited

ahmetmircik Jun 16, 2021 •

edited

mmedenjak commented Jun 16, 2021 •

edited