Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload replica sync to async executor #20813

Merged

Conversation

vbekiaris
Copy link
Collaborator

@vbekiaris vbekiaris commented Feb 22, 2022

PartitionReplicaSyncRequestOffloadable would block the priority
generic op thread while waiting for merkle tree comparison to occur,
leading to deadlocks.

Also restores node shutdown sequence of persistence engine as it was prior to c284b61

@vbekiaris
Copy link
Collaborator Author

run-ee-tests

@hz-devops-test
Copy link

The job Hazelcast-pr-builder-ee-tests of your PR failed. (Hazelcast internal details: build log, artifacts).
Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log file
--------------------------
---------SUMMARY----------
--------------------------
[ERROR] Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=2048m; support was removed in 8.0
--------------------------
--------------------------
-------TEST FAILURE-------
--------------------------
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   SerializedObjectsCompatibilityTest.testObjectsAreDeserializedInCurrentVersion_whenEESerializationService:122->assertObjectsAreDeserialized:157 Failed to deserialize classes:
com.hazelcast.config.MapConfig
com.hazelcast.internal.partition.operation.PartitionReplicaSyncResponse
com.hazelcast.internal.partition.operation.MigrationOperation
com.hazelcast.jet.impl.execution.init.ExecutionPlan
com.hazelcast.internal.partition.operation.MigrationRequestOperation
com.hazelcast.jet.core.DAG

[ERROR] SerializedObjectsCompatibilityTest.testObjectsAreDeserializedInCurrentVersion_whenOSSerializationService:100->assertObjectsAreDeserialized:157 Failed to deserialize classes:
com.hazelcast.config.MapConfig
com.hazelcast.internal.dynamicconfig.DynamicConfigPreJoinOperation
com.hazelcast.internal.partition.operation.PartitionReplicaSyncResponse
com.hazelcast.internal.dynamicconfig.AddDynamicConfigOperation
com.hazelcast.internal.partition.operation.MigrationOperation
com.hazelcast.internal.partition.operation.MigrationRequestOperation

[ERROR] Errors:
[ERROR] JobCancelledAfterUpgradeTest.when_jobSubmittedTo5_0_then_cancelledAfterUpgradeTo5_1:23->when_jobSubmittedTo5_0_then_cancelledAfterUpgradeTo5_1:23 ? VersionMismatch
[INFO]
[ERROR] Tests run: 10114, Failures: 2, Errors: 1, Skipped: 73
[INFO]

[ERROR] There are test failures.

@vbekiaris vbekiaris marked this pull request as ready for review February 24, 2022 15:50
@vbekiaris vbekiaris changed the title [WIP] Offload replica sync to async executor Offload replica sync to async executor Feb 24, 2022
@vbekiaris
Copy link
Collaborator Author

(1) https://github.com/hazelcast/hazelcast-enterprise/pull/4815 includes a test for the priority thread blocking issue.
(2) Wrong sequence of service shutdown results in ConcurrentConveyorExceptions being repeatedly thrown while gracefully shutting down, as migrations attempt to clear map record stores while persistence engine is already shut down. hzCmd test http://jenkins.hazelcast.com/view/hot-restart/job/hot-x1/ (green runs 34 through 36 are executed on this branch).

Sample stack trace of failing migration due to persistence engine being shut down:

com.hazelcast.internal.util.concurrent.ConcurrentConveyorException: Queue drainer has already left
        at com.hazelcast.internal.util.concurrent.ConcurrentConveyor.checkDrainerGone(ConcurrentConveyor.java:367)
        at com.hazelcast.internal.hotrestart.impl.ConcurrentHotRestartStore.submitAndProceedWhenAllowed(ConcurrentHotRestartStore.java:147)
        at com.hazelcast.internal.hotrestart.impl.ConcurrentHotRestartStore.clear(ConcurrentHotRestartStore.java:92)
        at com.hazelcast.map.impl.recordstore.HotRestartHDStorageImpl.clear(HotRestartHDStorageImpl.java:84)
        at com.hazelcast.map.impl.recordstore.DefaultRecordStore.clearStorage(DefaultRecordStore.java:1465)
        at com.hazelcast.map.impl.recordstore.DefaultRecordStore.destroyStorageAfterClear(DefaultRecordStore.java:1456)
        at com.hazelcast.map.impl.recordstore.DefaultRecordStore.clearPartition(DefaultRecordStore.java:1426)
        at com.hazelcast.map.impl.MapServiceContextImpl.removeRecordStoresFromPartitionMatchingWith(MapServiceContextImpl.java:352)
        at com.hazelcast.map.impl.MapMigrationAwareService.removeRecordStoresHavingLesserBackupCountThan(MapMigrationAwareService.java:294)
        at com.hazelcast.map.impl.MapMigrationAwareService.commitMigration(MapMigrationAwareService.java:234)
        at com.hazelcast.map.impl.EnterpriseMapMigrationAwareService.commitMigration(EnterpriseMapMigrationAwareService.java:32)
        at com.hazelcast.spi.impl.CountingMigrationAwareService.commitMigration(CountingMigrationAwareService.java:86)
        at com.hazelcast.map.impl.MapService.commitMigration(MapService.java:179)
        at com.hazelcast.internal.partition.operation.FinalizeMigrationOperation.finishMigration(FinalizeMigrationOperation.java:225)

PartitionReplicaSyncRequestOffloadable would block the priority
generic op thread while waiting for merkle tree comparison to occur,
leading to deadlocks.
NodeExtension#shutdown should be called after graceful-shutdown-aware
services are already shutdown. Otherwise persistence is shut down
before data services, resulting in exceptions during migrations
@vbekiaris vbekiaris force-pushed the experiments/5.1.z/offload-to-async branch from 41afa65 to a7e983d Compare March 2, 2022 13:34
Copy link
Contributor

@ufukyilmaz ufukyilmaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the fix.

@vbekiaris vbekiaris merged commit d48db96 into hazelcast:5.1.z Mar 3, 2022
@vbekiaris
Copy link
Collaborator Author

kudos @arodionov for the finding & reproducer.
thanks @ahmetmircik @ufukyilmaz for the reviews.
I will prepare back/fwd-ports for 5.0.z & main branch

@vbekiaris vbekiaris deleted the experiments/5.1.z/offload-to-async branch March 3, 2022 07:02
@vbekiaris vbekiaris added this to the 5.1.1 milestone Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants