Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

Closed
wesselkranenborg opened this issue Feb 15, 2023 · 13 comments
Closed

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

wesselkranenborg opened this issue Feb 15, 2023 · 13 comments

Comments

@wesselkranenborg
Copy link
Contributor

Version Information
Version of Akka.NET? 1.4.49
Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the bug
When I deploy a new version of our application sometimes randomly a shard is not able to startup. The error we get is this:

[23/02/15-12:55:39.7299][akka.tcp://system-state-service@10.0.8.100:12552/system/sharding/heartbeat-shardCoordinator/singleton/coordinator][akka://system-state-service/system/sharding/heartbeat-shardCoordinator/singleton/coordinator][0039]: Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeDeallocated] with sequence number [1049] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator]

image

To Reproduce
If we clear our journal/snapshot table the shard is able to start up again. But if we deploy another version of the application (with the same codebase) it might work, but we might also hit that a random other shard is having the same issues.

Expected behavior
When an error happens in replaying shard state (from Akka.Persistence), the shard should always be able to startup.

Actual behavior
The shard never starts up.

Sidenote: is this issue mitigated in 1.5.0 by splitting the data for remember_entities and coordinator? If so, I could try if this bug is still hitting us in 1.5.0.

@wesselkranenborg wesselkranenborg changed the title ClusterShardRegion of 1.5 Alpha version unable to restart ClusterShardRegion unable to start 'Shard 66 not allocated' Feb 15, 2023
@wesselkranenborg
Copy link
Contributor Author

wesselkranenborg commented Feb 15, 2023

After digging a bit deeper in the logs I find also this log:

[23/02/15-15:11:23.4677][akka.tcp://system-state-service@10.0.8.113:12552/system/sharding/heartbeat-shard/66][akka://system-state-service/system/sharding/heartbeat-shard/66][0021]: Acquiring lease LeaseSettings(system-state-service-shard-heartbeat-shard-66, system-state-service@10.0.8.113:12552, TimeoutSettings(00:00:12, 00:02:00, 00:00:15))
[23/02/15-15:11:23.4679][akka.tcp://system-state-service@10.0.8.113:12552/system/sharding/heartbeat-shard/66][akka://system-state-service/system/sharding/heartbeat-shard/66][0020]: Failed to get lease for shard type [heartbeat-shard] id [66]. Retry in 00:00:05

Might that be related? I don't see a relation between lease and restoring shard-region persistence state but that might be my limited knowledge.

@wesselkranenborg
Copy link
Contributor Author

wesselkranenborg commented Feb 15, 2023

I have now disabled Akka.Coordination.Lease on our shardregions and I keep getting the same error (now only on Shard 127, it's completely random at which shard it's happening).

If I clean the Akka.Persistence journal with the following prefixes the shards are starting up again: "$sharding", "$system$sharding".

If I then restart some node it might happen again, but also might just work.

Is it maybe related to this one? #5604 Looks familiair but that should be fixed in 1.4.39 and we're using 1.4.49.

@Aaronontheweb
Copy link
Member

@wesselkranenborg I think this issue is likely best fixed by Akka.NET v1.5, where the entire persistence / storage engine has been rewritten in order to be more scalable.

@wesselkranenborg
Copy link
Contributor Author

wesselkranenborg commented Feb 16, 2023

@Aaronontheweb: I know indeed that this is the case but was hoping that we could be able to fix it as we have this issue right now also in our production clucster. We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

@Aaronontheweb
Copy link
Member

We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

I'd be surprised if those two issues are correlated - the Akka.Hosting bits don't really touch the internals of sharding much, since it's all still the same Akka.NET v1.4* code under the surface.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

If you're not using remember-entities, then yes that's an effective work-around for this issue. Doing R-E with DData is a bit of a non-starter in 1.4 due to how non-performant the underlying LMDB storage is - in v1.5 we split the workload up so shard allocation can be tracked via DData and R-E can be tracked separately via Akka.Persistence (best of both worlds.)

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

Due on Feb 28th.

@Aaronontheweb
Copy link
Member

As a work-around for persistence we have https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool but I'll need our CI to get updated for it so we can pull in the latest Akka.NET v1.4

@wesselkranenborg
Copy link
Contributor Author

wesselkranenborg commented Feb 16, 2023

As a work-around for persistence we have https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool but I'll need our CI to get updated for it so we can pull in the latest Akka.NET v1.4

I know this one indeed but then this bug is also what we are hitting: petabridge/Akka.Persistence.Azure#130. We now manually delete the journal/snapshot data when it's occurring.

@Aaronontheweb
Copy link
Member

Ugh. Yeah that's a problem still.

@wesselkranenborg
Copy link
Contributor Author

But it happens quite often after a deployment. So after a deployment we should be very carefull. And also our remember-entities is cleared when we clean the snapshot/journal from all the /system/sharding records?

@wesselkranenborg
Copy link
Contributor Author

We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

I'd be surprised if those two issues are correlated - the Akka.Hosting bits don't really touch the internals of sharding much, since it's all still the same Akka.NET v1.4* code under the surface.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

If you're not using remember-entities, then yes that's an effective work-around for this issue. Doing R-E with DData is a bit of a non-starter in 1.4 due to how non-performant the underlying LMDB storage is - in v1.5 we split the workload up so shard allocation can be tracked via DData and R-E can be tracked separately via Akka.Persistence (best of both worlds.)

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

Due on Feb 28th.

I know indeed about the 1.5 changes. That would help us a lot! I'll try to use 1.5 in our INT environment to see if that indeed stabalises our cluster after deployemnts.

@Aaronontheweb
Copy link
Member

And also our remember-entities is cleared when we clean the snapshot/journal from all the /system/sharding records?

Yes it is, as it's all part of the same data set.

@wesselkranenborg
Copy link
Contributor Author

Thanks, good to know that. This might have some undesired side effects but is the best we can do until 1.5 I guess.

@wesselkranenborg
Copy link
Contributor Author

After the upgrade to 1.5 we never faced this issue, so will close this as uograding to 1.5 seems to be the solution for these errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants