New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore shutdown sequence & offload replica sync #20883
Restore shutdown sequence & offload replica sync #20883
Conversation
Restore shutdown sequence & offload replica sync PartitionReplicaSyncRequestOffloadable would block the priority generic op thread while waiting for merkle tree comparison to occur, leading to deadlocks. NodeExtension#shutdown should be called after graceful-shutdown-aware services are already shutdown. Otherwise persistence is shut down before data services, resulting in exceptions during migrations
@vbekiaris When |
When |
In the nonoffloaded case, this merkle tree comparison will run on partition threads and blocking them may not be as critical as blocking In this offload enabled case, the thread which was expected to offload the partition sync task was running on priority/generic threads that is more prone to deadlock if we don't offload the main task. See that we set this offload tasks' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the backports/forwardports.
What about checking ALLOW_OFFLOAD before creating migration operations to decide which kind of migration mechanism we follow? This is not to fall into this known issue unexpectedly later. So if ALLOW_OFFLOAD is false, service will not use merkle tree comparison for migrations. |
run-lab-run |
We already control whether we instantiate the offloadable or not partition replica sync response for anti-entropy mechanism using this property hazelcast/hazelcast/src/main/java/com/hazelcast/internal/partition/impl/PartitionReplicaManager.java Lines 249 to 251 in 750c504
or maybe I misunderstood your suggestion? |
I meant adding a new if here: https://github.com/hazelcast/hazelcast-enterprise/blob/463f8919c379882fcd3e6578041bcd9fce1bf34e/hazelcast-enterprise/src/main/java/com/hazelcast/map/impl/EnterpriseMapMigrationAwareService.java#L85 So when ALLOW_OFFLOAD is false, we directly call super. Maybe with a log message that we don't use merkle tree. |
I see, this is already covered here because we call
|
@vbekiaris thanks for the explanation, now i see that case is already covered. |
thanks @ufukyilmaz & @ahmetmircik ! |
Restore shutdown sequence & offload replica sync
PartitionReplicaSyncRequestOffloadable would block the priority
generic op thread while waiting for merkle tree comparison to occur,
leading to deadlocks.
NodeExtension#shutdown should be called after graceful-shutdown-aware
services are already shutdown. Otherwise persistence is shut down
before data services, resulting in exceptions during migrations
Foward-port of #20813 to main branch