New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail fast at startup for invalid WAN merge policy #18875
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we testing the case if this call fails? If I'm right, this fails in the WAN target cluster if it has stats disabled while the used merge policy relies on the stats. Can the split-brain healing get here with an invalid config? As I get it, with the recent changes we fail-fast on startup. Or can a RU config change drive us here and fail it during split-brain healing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added PR description to explain this.
You are right that this check can fail after a RU config change but not sure what can be the other expected behavior in such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can get here. I don't think we do any checks for dynamically added configs, so you can add invalid config and get here after split-brain. Don't know about the other questions.
I'm also curious about a related topic. Is this operation failed or retried after this throws an exception? Wondering if we ever give up, both in split-brain healing and WAN, or do we keep filling the logs and retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmetmircik which RU config is this? And what does this check cover? I forgot.
I'm wondering if it's something we can simply avoid checking for now, and just check WAN merge policies at startup, not runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It covers the scenario i mentioned in PR description: You have WAN replication between 2 clusters, even though source cluster has per-entry-stats enabled, target cluster doesn't have it enabled. In this case having same merge-policy in both sides doesn't make sense(since it can require per-entry-stats-enabled on both sides) hence we catch this anomaly. I see this as an edge case but still possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's the RU case that Zoltan was mentioning? Someone having this kind of setup and doing RU from 4.2.0 to 4.2.1 would start getting failures? In this case we assume that the 4.2.0 policy would simply always choose one of the two entries, because one of the values is always -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about SB, but WAN expects the replicated event to succeed. We keep sending it until we get an ACK. So the clusters will start playing a neverending ping-pong if this check fails. Well, if whatever fails on the target. The only solution I can think of is extending the WAN protocol with a feature to respond with "hey, the merge policy you try to use for DS XXX doesn't work for me, stop replicating that DS".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complex :) I'm leaning towards simply removing this check. For now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't mean to do this change in a patch release :) We'll fail here (or soon after the check) and the situation is the same. So I'd opt for the most accurate error message. Maybe we can optionally delay the NAK to prevent the source cluster to overwhelm the target and leave time for the ops to intervene. Need to check how it can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm for removing the check but adding a log message.