Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster startup hard block #5515

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

Zetanova
Copy link
Contributor

Fixes #5498
Fix startup with missing config for channel-executor

Changes

The constructor thread of cluster startup got reused by the default TaskScheduler used by the ChannelExecutor.

@@ -140,7 +140,11 @@ public Cluster(ActorSystemImpl system)
_readView = new ClusterReadView(this);

// force the underlying system to start
_clusterCore = GetClusterCoreRef().Result;
// and hard block the current thread
var clusterCoreTask = Task.Run(GetClusterCoreRef);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any flags we should consider adding here? i.e. would it be better to do Task.Factory.StartNew with LongRunning and DenyChildAttach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clusterCore = GetClusterCoreRef().Result was always locking and I think its still is.

I think what happend is that after the Ask got improved the current thread gets used for the actorcell dispatcher.
I don't know how/where exactly, but the child-actors of cluster-daemon are using it.

´Task.Run(GetClusterCoreRef)´ should force the ask on a different thread,
this is the only thing that matters, that it the current thread does not leak.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what happend is that after the Ask got improved the current thread gets used for the actorcell dispatcher.

Interesting. Would there be other implications if this is true?

cc: @Aaronontheweb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actor system extensions implementation and the locking cluster constructor are both just anti-patterns
and we need to rework it in the future.
#5447

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its just the "ConfigureAwait(false)" of

return await _clusterDaemons.Ask<IActorRef>(new InternalClusterAction.GetClusterCoreRef(this), timeout).ConfigureAwait(false);

The Cluster Extension is called by the ClusterActorRefProvider and it is created in the ActorSystem Startup.
The thread is the same as of the ActorSystem creator and with the normal ForkJoinExecutor they will never mix.
But with the ChannelExecutor that uses the normal ThreadPool of dotnet, the awaiting thread can be used for an ActorCell.

var task = Task.Run(...);
task.Wait();

I hope that simply resolve the problem.

task.Result blocked at .net 4.5 for sure,
I don't know why/how the thread gets now reused.

@Zetanova Zetanova mentioned this pull request Jan 27, 2022
2 tasks
@Aaronontheweb
Copy link
Member

I have not had time to review this - hopefully I'll be able to come up for air next week.

{
case "internal-dispatcher":
case "default-remote-dispatcher":
Priority = TaskSchedulerPriority.High;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default values should be baked in the default HOCON config, not in code; i.e. they have to be transparent to the user, user should only read the config file to figure out the default value of a setting, not dig through the source code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are the default in the config.
But the testkit does not load the default config

@@ -120,8 +119,26 @@ internal sealed class ChannelExecutorConfigurator : ExecutorServiceConfigurator
{
public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites)
{
var cfg = config.GetConfig("channel-executor");
Priority = (TaskSchedulerPriority)Enum.Parse(typeof(TaskSchedulerPriority), cfg.GetString("priority", "normal"), true);
var priorityName = config.GetString("channel-executor.priority", "None") ?? "None";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you're accessing ActorSystem.Settings.Config directly, always assume that a Config object passed into a method can be null. Need to check for null config instance here.

@@ -120,8 +119,26 @@ internal sealed class ChannelExecutorConfigurator : ExecutorServiceConfigurator
{
public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites)
{
var cfg = config.GetConfig("channel-executor");
Priority = (TaskSchedulerPriority)Enum.Parse(typeof(TaskSchedulerPriority), cfg.GetString("priority", "normal"), true);
var priorityName = config.GetString("channel-executor.priority", "None") ?? "None";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes does not have anything to do with making cluster startup to block, is it? Can you remove these and move it to a new PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required to use the channel-executor in the spec.

I added one spec that simple starts the cluster with the channel-executor and discovered this problems.

The spec itself will not trigger the original failure, because the testkit is using a SynchronisationContext
for the akka system startup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this PR can easily be broken into 2 PR, one depending to the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a problem anymore, fixed in #5568

config.GetInt("parallelism-min"),
config.GetDouble("parallelism-factor", 1.0D), // the scalar-based factor to scale the threadpool size to
config.GetInt("parallelism-max"));
config?.GetInt("parallelism-min", 4) ?? 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, these should be placed in a different PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when using ChannelTaskScheduler for internal-dispatcher in akka cluster node
4 participants