Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in SplitBrainResolver.PreStart when using ChannelTaskScheduler for internal-dispatcher #5962

Closed
SchiessMax opened this issue May 25, 2022 · 10 comments

Comments

@SchiessMax
Copy link

Version Information
Version of Akka.NET? 1.4.38
Which Akka.NET Modules? Akka.Cluster

Describe the bug
When using the Akka.Cluster.SBR.SplitBrainResolver while having configured the ChannelTaskScheduler as internal dispatcher an error is thrown in the PreStart method. The error is thrown when Cluster.Get(Context.System) is called.
I think this might be related to #5861

The error that is logged:

ValueFactory attempted to access the Value property of this instance.
      [akka://CasActorSystem/system/cluster/core/daemon/downingProvider#1885816536]: Akka.Actor.ActorInitializationException: Exception during creation
       ---> System.InvalidOperationException: ValueFactory attempted to access the Value property of this instance.
         at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
         at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
         at System.Lazy`1.CreateValue()
         at System.Lazy`1.get_Value()
         at Akka.Actor.Internal.ActorSystemImpl.TryGetExtension[T](T& extension)
         at Akka.Actor.Internal.ActorSystemImpl.GetExtension[T]()
         at Akka.Actor.ActorSystemWithExtensions.WithExtension[T,TI](ActorSystem system)
         at Akka.Cluster.Cluster.Get(ActorSystem system)
         at Akka.Cluster.SBR.SplitBrainResolver.PreStart()
         at Akka.Actor.ActorBase.AroundPreStart()
         at Akka.Actor.ActorCell.<>c__DisplayClass176_0.<Create>b__0()
         at Akka.Actor.ActorCell.UseThreadContext(Action action)
         at Akka.Actor.ActorCell.Create(Exception failure)
         --- End of inner exception stack trace ---
         at Akka.Actor.ActorCell.Create(Exception failure)
         at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState) 

To Reproduce
Steps to reproduce the behavior:

  1. Create an ActorSystem with the following configuration:
akka {
    actor {
        provider = cluster
        default-dispatcher = {
            executor = "channel-executor"
            channel-executor.priority = "normal"
        }
       # Adding this part in combination with the SplitBrainResolverProvider causes the error
        internal-dispatcher = {
            executor = "channel-executor"
            channel-executor.priority = "high"
        }
    }
    remote {
        dot-netty.tcp {
            port = 5508
            hostname = 127.0.0.1
        }
        default-remote-dispatcher {
            executor = "channel-executor"
            channel-executor.priority = "high"
        }
        backoff-remote-dispatcher {
            executor = "channel-executor"
            channel-executor.priority = "low"
        }
    }
    cluster {
        seed-nodes = ["akka.tcp://ActorSystem@127.0.0.1:5508"]
        downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
    }
}

Expected behavior
The node should startup without errors

Actual behavior
An error is thrown in the SplitBrainResolver class

Environment
Are you running on Linux? Windows? Docker? Which version of .NET?
=> Error occurs in Linux, Windows and Docker. .NET Version: net6.0

@abeltsios
Copy link

Hello we have reached the same conclusion. (I was about to post the same bug here)

Some more context:

Split brain resolver calls _cluster = Cluster.Get(Context.System); in order to be able to subscribe to cluster events on PreStart phase.

That leads to exceptions being thrown during ActorSystems Extension's initialization code (ie, Lazy.Value fails) when called from SplitBrainResolver.
As such SBR actor never initializes (as it is unsupervised).

The exception thrown is

Cause: [akka://stx-pandora/system/cluster/core/daemon/downingProvider#2113384032]: Akka.Actor.ActorInitializationException: Exception during creation
 ---> System.InvalidOperationException: ValueFactory attempted to access the Value property of this instance.
   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
   at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
   at System.Lazy`1.CreateValue()
   at System.Lazy`1.get_Value()
   at Akka.Actor.Internal.ActorSystemImpl.TryGetExtension[T](T& extension)
   at Akka.Actor.Internal.ActorSystemImpl.GetExtension[T]()
   at Akka.Actor.ActorSystemWithExtensions.WithExtension[T,TI](ActorSystem system)
   at Akka.Cluster.Cluster.Get(ActorSystem system)
   at Akka.Cluster.SBR.SplitBrainResolver.PreStart()
   at Akka.Actor.ActorBase.AroundPreStart()
   at Akka.Actor.ActorCell.<>c__DisplayClass176_0.<Create>b__0()
   at Akka.Actor.ActorCell.UseThreadContext(Action action)
   at Akka.Actor.ActorCell.Create(Exception failure)
   --- End of inner exception stack trace ---
   at Akka.Actor.ActorCell.Create(Exception failure)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)

After a quick debug a simple change like the one bellow seems to resolve the issue (sorry for the git diff dump)

diff --git a/src/core/Akka.Cluster/SBR/SplitBrainResolver.cs b/src/core/Akka.Cluster/SBR/SplitBrainResolver.cs
index 12c203e2c..1a528e0f7 100644
--- a/src/core/Akka.Cluster/SBR/SplitBrainResolver.cs
+++ b/src/core/Akka.Cluster/SBR/SplitBrainResolver.cs
@@ -39,13 +39,21 @@ public static Props Props2(TimeSpan stableAfter, DowningStrategy strategy)
             return Props.Create(() => new SplitBrainResolver(stableAfter, strategy));
         }

+        protected override bool Receive(object message)
+        {
+            if (_cluster == null)
+            {
+                _cluster = Cluster.Get(Context.System);
+                _cluster.Subscribe(Self, InitialStateAsEvents, typeof(IClusterDomainEvent));
+            }
+            return base.Receive(message);
+        }
+
         // re-subscribe when restart
         protected override void PreStart()
         {
-            _cluster = Cluster.Get(Context.System);
-            _cluster.Subscribe(Self, InitialStateAsEvents, typeof(IClusterDomainEvent));
-
-            base.PreStart();
+           Self.Tell(new object());
+           base.PreStart();
         }

@Aaronontheweb
Copy link
Member

So #5861 (comment) was my attempted fix for this issue as Akka.NET v1.4.37 also has this bug, but it would occur in the constructor and blow up the cluster supervisor.

As such SBR actor never initializes (as it is unsupervised).

All actors are supervised - and in this case the SBR is supervised by the default supervision strategy (restart.)

The worst case scenario right now is the SplitBrainResolver should just restart, whereas before cluster initialization itself could potentially fail (because throwing in the constructor blows up the parent instantiating the actor too, which in this case was one of the core clustering actors.) While this issue is disconcerting I don't think it's going to have much production impact.

@SchiessMax
Copy link
Author

SchiessMax commented May 25, 2022

Ok, so the SplitBrainResolver should still work. I will try it out, and if it does work I will just have to live with the error log for now.

@Aaronontheweb
Copy link
Member

It should definitely still work - but we'll fix this, either by breaking the SBR public API and passing in Cluster directly or by implementing some of @Zetanova 's suggestions here: #5447

@Aaronontheweb Aaronontheweb added this to the 1.4.39 milestone May 25, 2022
@Zetanova
Copy link
Contributor

I have multiple PR's out for Cluster, ClusterDaemon out.

I am still at 1.4.29-beta637735681605651069 in this version one PR got merged and after because of some minor and already foreseen issue with legacy DI pulled back.

The main issue simply the current extensions management over Lazy<>.
see: #5447

I try to implement an ExtensionManager soon.

@abeltsios
Copy link

abeltsios commented May 25, 2022

As such SBR actor never initializes (as it is unsupervised).

All actors are supervised - and in this case the SBR is supervised by the default supervision strategy (restart)

Pardon my phrasing!
No-one triggers any restart and as such if someone tries to reproduce it, given that he adds or removes nodes in a cluster, several deadlettered messages can be observed

My understanding from the logs bellow is that SBR doesn't restart.

remoting (akka://akkasys) - Starting remoting
remoting (akka://akkasys) - Remoting started; listening on addresses : [akka.tcp://akkasys@hostname:4061]
remoting (akka://akkasys) - Remoting now listens on addresses: [akka.tcp://akkasys@hostname:4061]
Cluster (akka://akkasys) - Cluster Node [akka.tcp://akkasys@hostname:4061] - Starting up...
akka://akkasys/system/cluster/core/daemon/downingProvider - ValueFactory attempted to access the Value property of this instance.
[akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128]: Akka.Actor.ActorInitializationException: Exception during creation
---> System.InvalidOperationException: ValueFactory attempted to access the Value property of this instance.
	at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
	at System.Lazy`1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
	at System.Lazy`1.CreateValue()
	at System.Lazy`1.get_Value()
	at Akka.Actor.Internal.ActorSystemImpl.TryGetExtension[T](T& extension)
	at Akka.Actor.Internal.ActorSystemImpl.GetExtension[T]()
	at Akka.Actor.ActorSystemWithExtensions.WithExtension[T,TI](ActorSystem system)
	at Akka.Cluster.Cluster.Get(ActorSystem system)
	at Akka.Cluster.SBR.SplitBrainResolver.PreStart()
	at Akka.Actor.ActorBase.AroundPreStart()
	at Akka.Actor.ActorCell.<>c__DisplayClass176_0.<Create>b__0()
	at Akka.Actor.ActorCell.UseThreadContext(Action action)
	at Akka.Actor.ActorCell.Create(Exception failure)
	--- End of inner exception stack trace ---
	at Akka.Actor.ActorCell.Create(Exception failure)
	at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)
Cluster (akka://akkasys) - Cluster Node [akka.tcp://akkasys@hostname:4061] - Started up successfully
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [1] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [2] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [3] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [4] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [5] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [6] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [7] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [8] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [9] dead letters encountered. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
akka://akkasys/system/cluster/core/daemon/downingProvider - Message [TimerMsg] to [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [00:05:00]. If this is not an expected behavior then [akka://akkasys/system/cluster/core/daemon/downingProvider#2124238128] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Should the above happen if the SBR had restarted?

@Aaronontheweb
Copy link
Member

Aaronontheweb commented May 26, 2022

Well I learned something new today

Throwing an exception in an actor's PreStart block will not allow the actor to be restarted: https://replit.com/@Aaronontheweb/ActorsPreStart#main.cs

I'm trying to imagine if that's by design or not (it might be since the child actor can't really complete its startup sequence if it fails there.)

Either way, that means this bug is more severe than I thought. We'll get this patched right away.

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue May 26, 2022
Arkatufus added a commit to Arkatufus/akka.net that referenced this issue May 26, 2022
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue May 26, 2022
@Aaronontheweb
Copy link
Member

Have a fix for this in #5965

@Aaronontheweb
Copy link
Member

Verifiable reproduction by @Arkatufus here #5966

Arkatufus pushed a commit that referenced this issue May 26, 2022
* pass `Akka.Cluster.Cluster` into `IDowningProvider` directly

close #5962

* enhanced `StartupWithOneThreadSpec` to include SBR

* C#-ified HOCON

* added #5962 repro

* fixed `AutoDown`

* fixed other `IDowningProvider` instances

* updated Akka.Cluster API approvals
@Aaronontheweb
Copy link
Member

Aaronontheweb commented May 26, 2022

Resolved via #5965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants