ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

wesselkranenborg · 2023-02-15T15:57:18Z

Version Information
Version of Akka.NET? 1.4.49
Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the bug
When I deploy a new version of our application sometimes randomly a shard is not able to startup. The error we get is this:

[23/02/15-12:55:39.7299][akka.tcp://system-state-service@10.0.8.100:12552/system/sharding/heartbeat-shardCoordinator/singleton/coordinator][akka://system-state-service/system/sharding/heartbeat-shardCoordinator/singleton/coordinator][0039]: Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeDeallocated] with sequence number [1049] for persistenceId [/system/sharding/heartbeat-shardCoordinator/singleton/coordinator]

To Reproduce
If we clear our journal/snapshot table the shard is able to start up again. But if we deploy another version of the application (with the same codebase) it might work, but we might also hit that a random other shard is having the same issues.

Expected behavior
When an error happens in replaying shard state (from Akka.Persistence), the shard should always be able to startup.

Actual behavior
The shard never starts up.

Sidenote: is this issue mitigated in 1.5.0 by splitting the data for remember_entities and coordinator? If so, I could try if this bug is still hitting us in 1.5.0.

wesselkranenborg · 2023-02-15T16:02:20Z

After digging a bit deeper in the logs I find also this log:

[23/02/15-15:11:23.4677][akka.tcp://system-state-service@10.0.8.113:12552/system/sharding/heartbeat-shard/66][akka://system-state-service/system/sharding/heartbeat-shard/66][0021]: Acquiring lease LeaseSettings(system-state-service-shard-heartbeat-shard-66, system-state-service@10.0.8.113:12552, TimeoutSettings(00:00:12, 00:02:00, 00:00:15))
[23/02/15-15:11:23.4679][akka.tcp://system-state-service@10.0.8.113:12552/system/sharding/heartbeat-shard/66][akka://system-state-service/system/sharding/heartbeat-shard/66][0020]: Failed to get lease for shard type [heartbeat-shard] id [66]. Retry in 00:00:05

Might that be related? I don't see a relation between lease and restoring shard-region persistence state but that might be my limited knowledge.

wesselkranenborg · 2023-02-15T21:50:28Z

I have now disabled Akka.Coordination.Lease on our shardregions and I keep getting the same error (now only on Shard 127, it's completely random at which shard it's happening).

If I clean the Akka.Persistence journal with the following prefixes the shards are starting up again: "$sharding", "$system$sharding".

If I then restart some node it might happen again, but also might just work.

Is it maybe related to this one? #5604 Looks familiair but that should be fixed in 1.4.39 and we're using 1.4.49.

Aaronontheweb · 2023-02-16T05:06:56Z

@wesselkranenborg I think this issue is likely best fixed by Akka.NET v1.5, where the entire persistence / storage engine has been rewritten in order to be more scalable.

wesselkranenborg · 2023-02-16T05:34:13Z

@Aaronontheweb: I know indeed that this is the case but was hoping that we could be able to fix it as we have this issue right now also in our production clucster. We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

Aaronontheweb · 2023-02-16T13:01:21Z

We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

I'd be surprised if those two issues are correlated - the Akka.Hosting bits don't really touch the internals of sharding much, since it's all still the same Akka.NET v1.4* code under the surface.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

If you're not using remember-entities, then yes that's an effective work-around for this issue. Doing R-E with DData is a bit of a non-starter in 1.4 due to how non-performant the underlying LMDB storage is - in v1.5 we split the workload up so shard allocation can be tracked via DData and R-E can be tracked separately via Akka.Persistence (best of both worlds.)

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

Due on Feb 28th.

Aaronontheweb · 2023-02-16T13:13:32Z

As a work-around for persistence we have https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool but I'll need our CI to get updated for it so we can pull in the latest Akka.NET v1.4

wesselkranenborg · 2023-02-16T13:17:28Z

As a work-around for persistence we have https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool but I'll need our CI to get updated for it so we can pull in the latest Akka.NET v1.4

I know this one indeed but then this bug is also what we are hitting: petabridge/Akka.Persistence.Azure#130. We now manually delete the journal/snapshot data when it's occurring.

Aaronontheweb · 2023-02-16T13:19:40Z

Ugh. Yeah that's a problem still.

wesselkranenborg · 2023-02-16T13:21:06Z

But it happens quite often after a deployment. So after a deployment we should be very carefull. And also our remember-entities is cleared when we clean the snapshot/journal from all the /system/sharding records?

wesselkranenborg · 2023-02-16T13:22:31Z

We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it.

I'd be surprised if those two issues are correlated - the Akka.Hosting bits don't really touch the internals of sharding much, since it's all still the same Akka.NET v1.4* code under the surface.

Is switching to StateStoreMode = StateStoreMode.DData an options to work around this? As to me it looks like a Persistence Replay issue and corrupted state.

If you're not using remember-entities, then yes that's an effective work-around for this issue. Doing R-E with DData is a bit of a non-starter in 1.4 due to how non-performant the underlying LMDB storage is - in v1.5 we split the workload up so shard allocation can be tracked via DData and R-E can be tracked separately via Akka.Persistence (best of both worlds.)

Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue.

Due on Feb 28th.

I know indeed about the 1.5 changes. That would help us a lot! I'll try to use 1.5 in our INT environment to see if that indeed stabalises our cluster after deployemnts.

Aaronontheweb · 2023-02-16T13:48:46Z

And also our remember-entities is cleared when we clean the snapshot/journal from all the /system/sharding records?

Yes it is, as it's all part of the same data set.

wesselkranenborg · 2023-02-16T13:51:56Z

Thanks, good to know that. This might have some undesired side effects but is the best we can do until 1.5 I guess.

wesselkranenborg · 2023-03-16T21:17:42Z

After the upgrade to 1.5 we never faced this issue, so will close this as uograding to 1.5 seems to be the solution for these errors

wesselkranenborg changed the title ~~ClusterShardRegion of 1.5 Alpha version unable to restart~~ ClusterShardRegion unable to start 'Shard 66 not allocated' Feb 15, 2023

Aaronontheweb added the akka-cluster-sharding label Feb 16, 2023

wesselkranenborg closed this as completed Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

wesselkranenborg commented Feb 15, 2023

wesselkranenborg commented Feb 15, 2023 •

edited

wesselkranenborg commented Feb 15, 2023 •

edited

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023 •

edited

Aaronontheweb commented Feb 16, 2023

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023 •

edited

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

wesselkranenborg commented Mar 16, 2023

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

Comments

wesselkranenborg commented Feb 15, 2023

wesselkranenborg commented Feb 15, 2023 • edited

wesselkranenborg commented Feb 15, 2023 • edited

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023 • edited

Aaronontheweb commented Feb 16, 2023

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023 • edited

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

Aaronontheweb commented Feb 16, 2023

wesselkranenborg commented Feb 16, 2023

wesselkranenborg commented Mar 16, 2023

wesselkranenborg commented Feb 15, 2023 •

edited

wesselkranenborg commented Feb 15, 2023 •

edited

wesselkranenborg commented Feb 16, 2023 •

edited

wesselkranenborg commented Feb 16, 2023 •

edited