Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashedWheelTimer startup crash on .NET 6+ #7174

Merged

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Apr 26, 2024

Changes

More issues with the PeriodicTimer - this time it's a bit of a spooky heisenbug. Added a spec that can reproduce it but it must be run continuously in order to catch it.

Originally reproduced at petabridge/TurboMqtt#55

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

More issues with the `PeriodicTimer` - this time it's a bit of a spooky heisenbug. Added a spec that can reproduce it but it must be run continuously in order to catch it.
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our problems were related to the combination of Interlocked and volatile both trying to access the same field across multiple cores - but I don't think I can prove it other than to say that since I made this change to the HashedWheelTimerScheduler I've run this test continuously for over 30 minutes (it runs about once a second) without any errors. Previously, it would fail in about 5 minutes.

I'll keep the thing running for a while longer and report back if the problem isn't fixed, but I think this should resolve it.

src/core/Akka/Actor/Scheduler/HashedWheelTimerScheduler.cs Outdated Show resolved Hide resolved
@Aaronontheweb Aaronontheweb added the NO MERGE Don't merge. label Apr 26, 2024
@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Apr 26, 2024

Took a while but was able to make this crash even without the volatile - I'll go ahead and add that back then. I think what I'll do instead is just remove this else statement and have it loop back to the top via a goto.

edit: came up with a safer, less complicated solution

Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this change by doing only one volatile read at the start of the if statement - been running my test case at a rate of once per second for about 2 hours and haven't seen the problem occur since.

// </copyright>
// -----------------------------------------------------------------------

#nullable enable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled nullable for just this test - on my quest to ensure that we start doing that generally everywhere.

@@ -147,25 +147,27 @@ private static int NormalizeTicksPerWheel(int ticksPerWheel)

private void Start()
{
if (_workerState == WORKER_STATE_STARTED)
// only read the worker state once so it can't be a moving target for else-branch
var workerStateRead = _workerState;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key fix - cache the value once so we're not doing a volatile read on each branch of the if..else - that's what's lead to the problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb removed the NO MERGE Don't merge. label Apr 26, 2024
@Aaronontheweb
Copy link
Member Author

The solution is good for merge now

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -147,25 +147,27 @@ private static int NormalizeTicksPerWheel(int ticksPerWheel)

private void Start()
{
if (_workerState == WORKER_STATE_STARTED)
// only read the worker state once so it can't be a moving target for else-branch
var workerStateRead = _workerState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@to11mtm
Copy link
Member

to11mtm commented Apr 26, 2024 via email

@Aaronontheweb
Copy link
Member Author

I believe our problems were related to the combination of Interlocked and volatile both trying to access the same field across multiple cores

this was wrong, btw - looks like it was the concurrent read / write access to volatile causing us to jump into the else statement that did it. Had nothing to do with Interlocked.

@Aaronontheweb
Copy link
Member Author

 Failed Akka.Cluster.Metrics.Tests.MetricValuesSpec.NodeMetrics_MetricValues_should_extract_expected_metrics_for_load_balancing [7 s]
  Error Message:
   System.NullReferenceException : Object reference not set to an instance of an object.
  Stack Trace:
     at Akka.Cluster.Metrics.Tests.MetricValuesSpec.NodeMetrics_MetricValues_should_extract_expected_metrics_for_load_balancing() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Metrics.Tests\MetricSpec.cs:line 324
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
  Passed Akka.Cluster.Metrics.Tests.EWMASpec.EWMA_should_calculate_correct_ewma_for_normal_d

Not related to my changes - I have an open PR for investigating this one at home.

@Aaronontheweb Aaronontheweb merged commit 1ef1869 into akkadotnet:dev Apr 26, 2024
8 of 12 checks passed
@Aaronontheweb Aaronontheweb deleted the fix-HashedWheelTimer-failure branch April 26, 2024 22:29
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants