Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert conversion to Lite member during Split-Brain healing #12691

Merged
merged 2 commits into from Apr 6, 2018

Conversation

ahmetmircik
Copy link
Member

@ahmetmircik ahmetmircik commented Mar 22, 2018

ee counterpart: https://github.com/hazelcast/hazelcast-enterprise/pull/2021

  • reverts conversion to lite member
  • disposes stores upon merge
  • unifies hd merge code with heap merge codes, ee side merge code will be removed

closes https://github.com/hazelcast/hazelcast-mono/issues/1427
closes #12405

@ahmetmircik ahmetmircik added this to the 3.10 milestone Mar 22, 2018
@ahmetmircik ahmetmircik force-pushed the fix/3.10/icmpFix branch 4 times, most recently from 12485c5 to 39f80fc Compare March 28, 2018 20:48
@ahmetmircik ahmetmircik changed the title WIP Revert conversion to lite member during heal Mar 29, 2018
@ahmetmircik ahmetmircik force-pushed the fix/3.10/icmpFix branch 3 times, most recently from 3fa85e6 to 8ab074c Compare March 29, 2018 10:30
@@ -106,19 +100,6 @@ private void disposeTasks(Collection<Runnable>... tasks) {
}
}

private void tryToPromoteLocalLiteMember() {
if (wasLiteMember) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasLiteMember field can be deleted too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -228,7 +227,7 @@ public DistributedObject createDistributedObject(String cacheNameWithPrefix) {
cacheConfig.setManagerPrefix(HazelcastCacheManager.CACHE_MANAGER_PREFIX);
}

CacheMergePolicyProvider mergePolicyProvider = splitBrainHandlerService.getMergePolicyProvider();
CacheMergePolicyProvider mergePolicyProvider = new CacheMergePolicyProvider(nodeEngine);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a single CacheMergePolicyProvider instance per HZ node, which is used for the config check and the split-brain merging. The provider class caches the created merge policy instances, so with a single instance we know for sure that the merge policy instance, which survived the config check is used in the split-brain process later. It also saves us a bunch of Class.for() calls and garbage.

That's why the legacy merge policy providers and all other data structures use the single instance of SplitBrainMergeProvider from the NodeEngine.

Same applies for IMap and ReplicatedMap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically the services or map service context should store the created merge policy provider and make it accessible to the merge process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define a LegacySplitBrainMergePolicyProvider interface with a Object getMergePolicy(String mergePolicyName) method. Then we can store a single instance in the AbstractSplitBrainHandlerService and make it easily accessible for the config checks and merge runnables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to one-instance-per-service.

@ahmetmircik ahmetmircik force-pushed the fix/3.10/icmpFix branch 3 times, most recently from 5c11902 to 4c5d483 Compare April 2, 2018 08:00
Copy link
Contributor

@mdogan mdogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just scanned lite-member related parts. 👍

// this method is overridden on ee
protected CacheSplitBrainHandlerService newSplitBrainHandlerService(NodeEngine nodeEngine) {
return new CacheSplitBrainHandlerService(nodeEngine, configs, segments);
public CacheMergePolicyProvider getMergePolicyProvider() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This can be package private if the test is moved to the same package. Also, if the field is package-private, we can remove the getter altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept this as is, it is being used by different packages on ee side


if (!isDiscardPolicy(mergePolicy)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we will collect even the stores that will be discarded and discard them later in the runnable. Any special reason for doing so? Can we revert to the old version on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this checks to split-brain-handler-service.

}

private void asyncDestroyStores(final int partitionID, final Collection<Store> stores) {
operationExecutor.execute(new PartitionSpecificRunnable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it happen that the partition is migrating at this point (e.g. because more nodes are joining the cluster)? If so, can you use the InvocationUtil#executeLocallyWithRetry?

* Like {@link #destroy()} but does not touch state on other services
* like lock service or event journal service.
*/
void destroyInternals();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting a bit cofusing - we have destroy (destroys the internals and the connected services), destroyInternals (destroys only the map internals), reset (I'm not sure how this one fits in, it also resets some data but it keeps some other like indexes and connected services like locks and journal) and clearPartition (just clears everything, also connected services).
I don't see a simple way of improving this. Maybe a task for a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't agree more 💯

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for this PR to pass the tests but filed an improvement issue, this problem is a bit tricky. #12779

@Donnerbart Donnerbart changed the title Revert conversion to lite member during heal Revert conversion to Lite member during Split-Brain healing Apr 3, 2018
@ahmetmircik
Copy link
Member Author

run-lab-run

@pveentjer
Copy link
Contributor

@ahmetmircik can you rebase?

}

void asyncDestroyStores(final Collection<Store> stores, final int partitionID) {
operationExecutor.execute(new PartitionSpecificRunnable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the partition be migrating? Do we need to retry? See - #12691 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to retry, PartitionSpecificRunnable can independently run on local without doing any checks like is-migration-in-progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Thanks!

}

protected void onMerge(String dataStructureName) {
// override to take action on prepare of merging data structures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this is executed on merge. You can fix this in a separate PR.

}

void asyncDestroyStores(final Collection<Store> stores, final int partitionID) {
operationExecutor.execute(new PartitionSpecificRunnable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Thanks!

@ahmetmircik ahmetmircik merged commit 8e2e01f into hazelcast:master Apr 6, 2018
@ahmetmircik ahmetmircik deleted the fix/3.10/icmpFix branch April 6, 2018 07:41
@mmedenjak mmedenjak added the Source: Internal PR or issue was opened by an employee label Apr 13, 2020
@hz-devops-test
Copy link

The job Hazelcast-pr-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast: Compilation failure: Compilation failure: 
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] error: Target option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast: Compilation failure: Compilation failure: 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] error: Target option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] -> [Help 1]
--------------------------
[ERROR] 
--------------------------
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
--------------------------
[ERROR] 
--------------------------
[ERROR] For more information about the errors and possible solutions, please read the following articles:
--------------------------
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
--------------------------
[ERROR] 
--------------------------
[ERROR] After correcting the problems, you can resume the build with the command
--------------------------
[ERROR]   mvn  -rf :hazelcast
--------------------------

@hz-devops-test
Copy link

The job Hazelcast-pr-EE-compiler of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
---------ERRORS-----------
--------------------------
[ERROR] COMPILATION ERROR : 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] error: Target option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project hazelcast: Compilation failure: Compilation failure: 
--------------------------
[ERROR] error: Source option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] error: Target option 6 is no longer supported. Use 7 or later.
--------------------------
[ERROR] -> [Help 1]
--------------------------
[ERROR] 
--------------------------
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
--------------------------
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
--------------------------
[ERROR] 
--------------------------
[ERROR] For more information about the errors and possible solutions, please read the following articles:
--------------------------
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
--------------------------
[ERROR] 
--------------------------
[ERROR] After correcting the problems, you can resume the build with the command
--------------------------
[ERROR]   mvn  -rf :hazelcast
--------------------------

@hz-devops-test
Copy link

The job Hazelcast-pr-builder of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
---------ERRORS-----------
--------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project hazelcast: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test failed: java.lang.NoClassDefFoundError: java/sql/Timestamp: java.sql.Timestamp -> [Help 1]
--------------------------
[ERROR] 
--------------------------
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
--------------------------
[ERROR] 
--------------------------
[ERROR] For more information about the errors and possible solutions, please read the following articles:
--------------------------
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
--------------------------
[ERROR] 
--------------------------
[ERROR] After correcting the problems, you can resume the build with the command
--------------------------
[ERROR]   mvn  -rf :hazelcast
--------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source: Internal PR or issue was opened by an employee Team: Core Type: Defect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ICMP block test started to report data loss.
7 participants