Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

SchiessMax · 2024-01-29T13:21:35Z

Version Information
Version of Akka.NET?
Which Akka.NET Modules?

"Akka.DependencyInjection" Version="1.5.15"
"Akka.Cluster" Version="1.5.15"
"Akka.Cluster.Sharding" Version="1.5.15"
"Akka.Cluster.Hosting" Version="1.5.15"
"Akka.Hosting" Version="1.5.15"

Describe the bug
We use a cluster singleton as directory service where other actors can register themselves. These subscribers are created when a websocket connects to our microservice, which runs in a kubernetes cluster. The directory services makes it easy for us to access the actors without knowing on which node the websocket was opened (we access by an ID).
Of course the singleton can also terminate, e.g. when the entire node dies. In our example we have three nodes, where the subscriber actors are randomly distributed and the singleton also lives of one of these nodes (as far as I known, it's the oldest node for singletons).
To avoid "zombie" actors we watch the singleton in our subscribers and resubscribe on the termination event.
Now the Problem is the following: Occasionally there still are zombie actors that are not known to the directory service. I implemented logging into the subscriber and directory actor and did a lot of restarting of the microservice instances (deleting the oldest service or deleting all services so they are recreated - we use kubernetes so this is easy to do). The logging makes it pretty clear, that the reason for the zombies is one the following: Sometimes, the Ask for the subscription response does neither return a subscription response nor a timeout exception. I also noticed, that sometimes there are not terminated messages, even if the subscribe was successfull and the actor started watching the singleton.

Is there something that we are doing wrong with the cluster singleton? Shouldn't there always be a terminated message, even if there is a network failure, or one of the node leaves the cluster for whatever reason and terminates?
Also, I am really surprised by the behavior of Ask. I assumed that there should always be an exception after the timeout, no matter if the target actor of the ask is a local or remote actor and no matter if it is alive, dead, or it dies while the ask is still waiting for a response.

Help would be really appreciated. Maybe we just don't use the singleton correctly. Our current workaround is to resend the subscribe message periodically, but it looks wrong when a tool like "Watch" exists.

To Reproduce
Steps to reproduce the behavior:

Create three akka nodes that form a cluster and configure a default ask timeout
Create a singleton that can live on any of the nodes and a proxy for it on all nodes
The singleton should allow to subscribe other actors and return a message that contains the "Self" reference, which is a remote actor if the subscriber lives on one of the other nodes
The subscriber should watch the returned actorref of the singleton with a custom message
The subscriber should also use the ask pattern for the subscribe message to the singleton
Repeatedly terminate nodes
Occasionally there is no termination message to the subscriber when the singleton dies
Occasionally ask on the remote actor does neither return a value nor a timeout exception

Links to working reproductions on Github / Gitlab are very much appreciated

Expected behavior
I expected that Watch and Ask always return messages, even with remote actors.

Actual behavior
There seem to be some corner cases where Watch and Ask don't return messages (or in case of Ask a Timeout exception) for remote actors.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
Are you running on Linux? Windows? Docker? Which version of .NET?

Kubernetes with linux docker images and .net7.0 runtimes

Additional context
Add any other context about the problem here.

SchiessMax · 2024-05-07T11:54:02Z

AkkaRemoteSampleProject.zip
I tried to reproduce the behavior with a simple project that is similar to our productive setup.
Whatever I tried, everything behaved correctly. I used the newest version 1.5.20, but even with 1.5.15 it seemed to work.
We reworked our code a few times, so it could also have been an implementation error on our part.
I will close this Issue as I can't reproduce it anymore. I'll attach the project in case anyone wants to take a look at it.

Aaronontheweb added akka-remote akka-actor potential bug labels Jan 31, 2024

SchiessMax closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

SchiessMax commented Jan 29, 2024 •

edited

SchiessMax commented May 7, 2024

Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

Comments

SchiessMax commented Jan 29, 2024 • edited

SchiessMax commented May 7, 2024

SchiessMax commented Jan 29, 2024 •

edited