New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterShardRegion unable to start 'Shard 66 not allocated' #6407
Comments
After digging a bit deeper in the logs I find also this log:
Might that be related? I don't see a relation between lease and restoring shard-region persistence state but that might be my limited knowledge. |
I have now disabled Akka.Coordination.Lease on our shardregions and I keep getting the same error (now only on Shard 127, it's completely random at which shard it's happening). If I clean the Akka.Persistence journal with the following prefixes the shards are starting up again: "$sharding", "$system$sharding". If I then restart some node it might happen again, but also might just work. Is it maybe related to this one? #5604 Looks familiair but that should be fixed in 1.4.39 and we're using 1.4.49. |
@wesselkranenborg I think this issue is likely best fixed by Akka.NET v1.5, where the entire persistence / storage engine has been rewritten in order to be more scalable. |
@Aaronontheweb: I know indeed that this is the case but was hoping that we could be able to fix it as we have this issue right now also in our production clucster. We see this happening since we upgraded the Akka.Hosting and Akka.Management packages to the 1.0.0 GA versions (and of course we upgraded the Akka.NET package with them). We have been running for months without having this issue and now almost every deployment we hit it. Is switching to Or is there anything else we can do until Akka.NET 1.5 is released (saw that the plan is to do that within a few weeks 🥳🎉)? My suspicion/gut feeling is that Akka.Coordination.Lease on the shard-region exacerbates the issue. |
I'd be surprised if those two issues are correlated - the Akka.Hosting bits don't really touch the internals of sharding much, since it's all still the same Akka.NET v1.4* code under the surface.
If you're not using remember-entities, then yes that's an effective work-around for this issue. Doing R-E with DData is a bit of a non-starter in 1.4 due to how non-performant the underlying LMDB storage is - in v1.5 we split the workload up so shard allocation can be tracked via DData and R-E can be tracked separately via Akka.Persistence (best of both worlds.)
Due on Feb 28th. |
As a work-around for persistence we have https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool but I'll need our CI to get updated for it so we can pull in the latest Akka.NET v1.4 |
I know this one indeed but then this bug is also what we are hitting: petabridge/Akka.Persistence.Azure#130. We now manually delete the journal/snapshot data when it's occurring. |
Ugh. Yeah that's a problem still. |
But it happens quite often after a deployment. So after a deployment we should be very carefull. And also our remember-entities is cleared when we clean the snapshot/journal from all the /system/sharding records? |
I know indeed about the 1.5 changes. That would help us a lot! I'll try to use 1.5 in our INT environment to see if that indeed stabalises our cluster after deployemnts. |
Yes it is, as it's all part of the same data set. |
Thanks, good to know that. This might have some undesired side effects but is the best we can do until 1.5 I guess. |
After the upgrade to 1.5 we never faced this issue, so will close this as uograding to 1.5 seems to be the solution for these errors |
Version Information
Version of Akka.NET? 1.4.49
Which Akka.NET Modules? Akka.Cluster.Sharding
Describe the bug
When I deploy a new version of our application sometimes randomly a shard is not able to startup. The error we get is this:
To Reproduce
If we clear our journal/snapshot table the shard is able to start up again. But if we deploy another version of the application (with the same codebase) it might work, but we might also hit that a random other shard is having the same issues.
Expected behavior
When an error happens in replaying shard state (from Akka.Persistence), the shard should always be able to startup.
Actual behavior
The shard never starts up.
Sidenote: is this issue mitigated in 1.5.0 by splitting the data for remember_entities and coordinator? If so, I could try if this bug is still hitting us in 1.5.0.
The text was updated successfully, but these errors were encountered: