Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic failures of Watch/Ask with remote actors (e.g. Singleton) #7075

Closed
SchiessMax opened this issue Jan 29, 2024 · 1 comment
Closed

Comments

@SchiessMax
Copy link

SchiessMax commented Jan 29, 2024

Version Information
Version of Akka.NET?
Which Akka.NET Modules?

"Akka.DependencyInjection" Version="1.5.15"
"Akka.Cluster" Version="1.5.15"
"Akka.Cluster.Sharding" Version="1.5.15"
"Akka.Cluster.Hosting" Version="1.5.15"
"Akka.Hosting" Version="1.5.15"

Describe the bug
We use a cluster singleton as directory service where other actors can register themselves. These subscribers are created when a websocket connects to our microservice, which runs in a kubernetes cluster. The directory services makes it easy for us to access the actors without knowing on which node the websocket was opened (we access by an ID).
Of course the singleton can also terminate, e.g. when the entire node dies. In our example we have three nodes, where the subscriber actors are randomly distributed and the singleton also lives of one of these nodes (as far as I known, it's the oldest node for singletons).
To avoid "zombie" actors we watch the singleton in our subscribers and resubscribe on the termination event.
Now the Problem is the following: Occasionally there still are zombie actors that are not known to the directory service. I implemented logging into the subscriber and directory actor and did a lot of restarting of the microservice instances (deleting the oldest service or deleting all services so they are recreated - we use kubernetes so this is easy to do). The logging makes it pretty clear, that the reason for the zombies is one the following: Sometimes, the Ask for the subscription response does neither return a subscription response nor a timeout exception. I also noticed, that sometimes there are not terminated messages, even if the subscribe was successfull and the actor started watching the singleton.

Is there something that we are doing wrong with the cluster singleton? Shouldn't there always be a terminated message, even if there is a network failure, or one of the node leaves the cluster for whatever reason and terminates?
Also, I am really surprised by the behavior of Ask. I assumed that there should always be an exception after the timeout, no matter if the target actor of the ask is a local or remote actor and no matter if it is alive, dead, or it dies while the ask is still waiting for a response.

Help would be really appreciated. Maybe we just don't use the singleton correctly. Our current workaround is to resend the subscribe message periodically, but it looks wrong when a tool like "Watch" exists.

To Reproduce
Steps to reproduce the behavior:

  1. Create three akka nodes that form a cluster and configure a default ask timeout
  2. Create a singleton that can live on any of the nodes and a proxy for it on all nodes
  3. The singleton should allow to subscribe other actors and return a message that contains the "Self" reference, which is a remote actor if the subscriber lives on one of the other nodes
  4. The subscriber should watch the returned actorref of the singleton with a custom message
  5. The subscriber should also use the ask pattern for the subscribe message to the singleton
  6. Repeatedly terminate nodes
  7. Occasionally there is no termination message to the subscriber when the singleton dies
  8. Occasionally ask on the remote actor does neither return a value nor a timeout exception

Links to working reproductions on Github / Gitlab are very much appreciated

Expected behavior
I expected that Watch and Ask always return messages, even with remote actors.

Actual behavior
There seem to be some corner cases where Watch and Ask don't return messages (or in case of Ask a Timeout exception) for remote actors.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment
Are you running on Linux? Windows? Docker? Which version of .NET?

  • Kubernetes with linux docker images and .net7.0 runtimes

Additional context
Add any other context about the problem here.

@SchiessMax
Copy link
Author

AkkaRemoteSampleProject.zip
I tried to reproduce the behavior with a simple project that is similar to our productive setup.
Whatever I tried, everything behaved correctly. I used the newest version 1.5.20, but even with 1.5.15 it seemed to work.
We reworked our code a few times, so it could also have been an implementation error on our part.
I will close this Issue as I can't reproduce it anymore. I'll attach the project in case anyone wants to take a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants