Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] .NET6 Performance Regression #5385

Closed
Aaronontheweb opened this issue Nov 15, 2021 · 31 comments
Closed

[PERF] .NET6 Performance Regression #5385

Aaronontheweb opened this issue Nov 15, 2021 · 31 comments

Comments

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 15, 2021

Version Information
Version of Akka.NET? v1.4.28
Which Akka.NET Modules? Akka, Akka.Remote

Describe the performance issue
Our RemotePingPong benchmark has been the standard used for 7 years or so to measure throughput passing over a single Akka.Remote connection between two ActorSystem instances. It's a crucial benchmark because it measures the biggest bottleneck in Akka.Remote networks: the end to end response time over a single connection.

Over the lifespan of .NET Core since 2017 we've seen steady improvements in the benchmark numbers each time a new version of the .NET runtime is released usually as an improvement of underlying threading / concurrency / IO primitives introduced into the runtime itself.

With the release of .NET 6, however, we've noticed that while the overall throughput in some measures remains higher than on .NET 5 for these same reasons - there are steady, reproducible, long-lasting drops in total throughput that occur only on .NET 6.

Data and Specs
Here are the RemotePingPong numbers from my local development machine, a Gen 1 8-core Ryzen, on .NET Core 3.1:

image

Edit: update the .NET Core 3.1 benchmark numbers to include the settings from #5386

And here are the equivalent numbers for this same benchmark on .NET 6:

image

I've been able to reproduce this consistently - a sustained drop in throughput that lasts for roughly 30s. We've also noticed this in the Akka.NET test suite since merging in #5373 - the number of failures in the test suite has grown and has started to include tests that historically have not been racy. We've also observed this separately in the Phobos repository which we also upgraded to use the .NET 6 SDK.

There is definitely something amiss here with how Akka.NET runs on top of .NET 6.

Expected behavior
A consistent level of performance across all benchmarks.

Actual behavior
Intermittent lag, declines in throughput, and unexplained novel race conditions.

Environment
.NET 6, Windows

Additional context
There is some speculation from other members of the Akka.NET team that the issue could be related to some of the .NET ThreadPool and thread injection changes made in .NET 6:

@to11mtm
Copy link
Member

to11mtm commented Nov 16, 2021

@Aaronontheweb I would suggest getting some more heruistics on the threads, i.e. set up multisampling (I mean, reading number of TP threads should be cheap overall) and figure out averageThreads, MaxThreads and threads over 95% of time. I can hack up a snippet to help if you'd like.

If I had to guess overall, I'd wager it's an issue with the changes in Threadpool hill-climbing algorithm in core, MAYBE some interplay with ThreadLocal if we are creating and destroying threads more frequently (not sure TBH, would need to dig.)

It may or may not be worth it to try running this with the actors running under DTP wherever possible under both environments; perf may be lower in both cases but that would at least help to isolate whether it -is- the .NET threadpool itself or something else creeping in.

@Aaronontheweb
Copy link
Member Author

I can hack up a snippet to help if you'd like.

That would be very helpful!

It may or may not be worth it to try running this with the actors running under DTP wherever possible under both environments; perf may be lower in both cases but that would at least help to isolate whether it -is- the .NET threadpool itself or something else creeping in.

That's a good idea. This should be easy to configure.

@Zetanova noticed anything like this on .NET 6 with the ChannelDispatcher?

@to11mtm
Copy link
Member

to11mtm commented Nov 17, 2021

I can hack up a snippet to help if you'd like.

That would be very helpful!

Here you go: https://gist.github.com/to11mtm/1c3f5137a207d59d5f3e61bb198aeeae . Note I haven't exactly -tested- this with the way I'm abusing CTS, and it's not -quite- thread safe (i.e. you'd need to make changes to be able to observe values while it is monitoring.) The only thing you might want to tweak is adding a method to pre-allocate the array and minimize the chance of a GC Happening:

public void PresetContainer(TimeSpan interval, TimeSpan worstCase)
{
   ThreadStats = new List<int>( (int)(worstCase/interval)+1);
}

There's plenty of other opportunities for cleanup/niceness there, but it is a good start if you want to get fancier (i.e. track when timer fires noticeably off interval to maintain rate, which is an indication of complete system overpressure since we are on a dedicated thread in this loop.)

@Aaronontheweb
Copy link
Member Author

Ah thanks, so do I just launch that in the background along with the benchmark to have it gather samples concurrently in its own thread?

@Zetanova
Copy link
Contributor

Sorry, dont have exp. with .net6 yet,
I am always waiting few month after release.

For sequential workload the thread number are to high.
Optimal count would be:

Akka.Remote:
2x IO thread for network
1-3x Threads for processing and serialization

Akka.Actor:
8x (LP-Count) for ActorDispacher instances
0x No AsyncTasks inside the Actors

This makes a theoretical max total for your system of 11-13 threads for each node.

If more threads/tasks then the theoretical max total are scheduled
then the execution sequence will relay only on the external OS thread scheduler and .net TaskScheduler algo.
There algo's are "superior" to what we can make, but do work on a subset of information.
One critical property for them is missing.

That's way I made the https://github.com/Zetanova/Akka.Experimental.ChannelTaskScheduler
to get control of the sequence of execution and not to overload the TaskScheduler with mixed priority tasks
and task that could be delayed for "minutes".

The actor system workload has an unique task scheduling property to be able to indefinitely delay the execution of ActorCells.
If the whole workload and more then the physically possible concurrent count is queued inside a normal TaskScheduler at once
this property will be simply lost. The TaskScheduler algo will do its best to execute all tasks as fast as possible, but as mentioned above it is working with a subset of information.

To use System.Threading.Channels is not a requirement but it solves major problems:

  1. thread-safe dequeue and enqueue operation in a performant way (and its managed)
  2. it can use async task operations and utilize the regular dotnet TaskScheduler (no dedicated thread management required)
  3. prioritizing of work items (system and cluster work items)
  4. simple implementation

@Aaronontheweb
Long story short:

  1. You can try to run the RemotePingPong benchmark with the ChannelDispatcher
    I am confidant that this issue will simply disappear.

  2. You can make a simple test benchmark with a 2 node cluster where
    Node A sends a very high amount of workload to Node B
    Node B executes a fake Task.Deplay(msg.WorkSize).PipeTo(Sender, new WorkCompleted(msg.Id))
    The Test should run over a fixed period (3-5 minutes), the completed work items over this period should be counted
    and used as score/result
    What I am expecting is that the current/old scheduling system will produce a lot of cluster delay workings
    or even a node dissociation.

@Aaronontheweb
Copy link
Member Author

You can try to run the RemotePingPong benchmark with the ChannelDispatcher
I am confidant that this issue will simply disappear.

I'll give that a shot @Zetanova and report back on here.

@Aaronontheweb
Copy link
Member Author

Filed a CoreCLR issue here: dotnet/runtime#62967

@Arkatufus
Copy link
Contributor

This is my finding from trying to run Akka.Tests.Actor.Scheduler.DefaultScheduler_ActionScheduler_Schedule_Tests.ScheduleRepeatedly_in_milliseconds_Tests_and_verify_the_interval

  • Under very limited resources, 1 virtual CPU and 3 GB of memory, HashedWheelTimerScheduler always lags 500-800 ms in every platform, and in both old and new code.
  • Under 2 virtual CPU, netcoreapp3.1 passed but net6.0 failed for both code variants
  • Under 3 virtual CPU, both netcoreapp3.1 and net6.0 passes for both code variants

It looks like the spec failure is actually inherent in the HashedWheelTimerScheduler and only appears when it is resource starved and net6.0 new thread pool is more resource hungry compared to the previous thread pool implementation.

@Aaronontheweb
Copy link
Member Author

Related PR: #5441

@Zetanova
Copy link
Contributor

I could fix the test itself with priming it, see:
4d4e9eb

@KieranBond
Copy link

KieranBond commented May 5, 2022

@Aaronontheweb is there any update on this? We've been holding off upgrading to .NET Core 6 due to this issue but as it's coming to EOL we are starting to run out of time a bit.

@Aaronontheweb
Copy link
Member Author

@KieranBond in all of our large scale testing we've been doing on https://github.com/petabridge/AkkaDotNet.LargeNetworkTests I haven't seen any issues even in a 200+ node cluster, but we're running using the channel-executor there which will become the new default in Akka.NET v1.5. The channel-executor does come with a bit of a throughput penalty but it's 5x more efficient when it comes to CPU utilization.

I'm running a comparison today against .NET Core App 3.1, now that akkadotnet/Akka.Management#563 is fixed - that's what stopped me from working on this last night on my Twitch stream, per https://twitter.com/Aaronontheweb/status/1522024859571822595

@Aaronontheweb
Copy link
Member Author

@KieranBond just completed my experiment today and while I am somewhat baffled by the results. Going to repost my experiment notes here. This is all using the channel-executor.

.NET Core 3.1

We decided to re-run some of our experiments on .NET Core 3.1, to account for differences in the .NET ThreadPool implementation.

Experiment 9: No DistributedPubSub, Akka.Persistence Sharding, 200 Nodes, .NET Core 3.1

The cluster remained stable at 200 nodes:

image

Processing roughly 5000 msg/s.

Total thread-count was elevated compared to .NET Core 3.1 with roughly 71 reported threads per process, rather than 66 as observed in .NET 6:

image

And also interestingly, CPU utilization and memory usage were both significantly lower in .NET Core 3.1

image

Memory usage is normally around 80% in .NET 6 and CPU utilization around 14.7%.


The memory usage is the most shocking difference here - a difference of about 20GB worth of usage across the cluster.

I'm going to retest with DistributedPubSub enabled as this will dial up the message and network traffic quite a bit.

@Aaronontheweb
Copy link
Member Author

The message processing rates for this cluster are inline with what we've been doing on .NET 6 - graph from my last .NET 6 experiment (without DistributedPubSub enabled):

image

@Aaronontheweb
Copy link
Member Author

Might have been premature on my comments here regarding .NET Core 3.1 / .NET 6 - have a bunch more data that I'm going to publish from running several more iterations of this experiment on the same AKS cluster.

@davidfowl
Copy link

cc @kouvel as an FYI

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented May 6, 2022

Some benchmarks this morning for throughput - one running with normal Akka.NET v1.4 defaults (dedicated thread pool)

All benchmarks run using https://github.com/akkadotnet/akka.net/tree/dev/src/benchmark/RemotePingPong

Akka.NET V1.4 Defaults (Dedicated Thread Pool)

.NET 6

OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 130040, 1538.57, 112, 139
5, 1000000, 292227, 3422.35, 147, 161
10, 2000000, 314912, 6351.00, 169, 169
15, 3000000, 299581, 10014.01, 177, 162
20, 4000000, 217877, 18359.81, 170, 145
25, 5000000, 199761, 25030.86, 154, 140
30, 6000000, 313186, 19158.96, 148, 140

.NET Core 3.1

OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 112

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 134862, 1483.99, 112, 135
5, 1000000, 290361, 3444.48, 146, 163
10, 2000000, 298776, 6694.68, 171, 171
15, 3000000, 304724, 9845.54, 179, 163
20, 4000000, 304554, 13134.47, 171, 153
25, 5000000, 298597, 16745.21, 161, 140
30, 6000000, 301084, 19928.24, 149, 137

In both cases when we hit the client count=20,25 iteration the thread count drops quite a bit (I'm assuming due to hill-climbing) - but in the case of .NET Core 3.1 there's really no performance loss. Whereas with .NET 6 throughput drops from ~300k msg/s to 200k msg/s and stays there for a period of time. We've been able to reproduce this regularly. Worth noting I'm running all of these benchmarks on a Gen 1 Ryzen machine, in case that makes any difference.

Akka.NET V1.5 Proposed Defaults (System.Threading.Channels over .NET ThreadPool)

I wanted to re-run these numbers without our DedicatedThreadPool in the mix since that's a factor in our defaults for Akka.NET v1.4. In v1.5 we're considering moving to using the System.Threading.Channels-based dispatcher by default which only uses the built-in .NET ThreadPool: #5908 - we're primarily motivated to do that because idle CPU consumption is significantly lower under this configuration, 1/5 the value of our current defaults.

.NET 6

OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 79

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 127633, 1567.73, 79, 95
5, 1000000, 179566, 5569.03, 114, 115
10, 2000000, 178923, 11178.54, 123, 118
15, 3000000, 140621, 21334.81, 130, 112
20, 4000000, 174902, 22870.84, 120, 103
25, 5000000, 177766, 28127.82, 113, 102
30, 6000000, 177180, 33864.27, 112, 99

It's not as easy to notice, but just as consistent in this test - but once the .NET 6 threadpool pares down the number of threads at Numclients=15 there is a noticeable drop in throughput. I was able to repeat this three times before posting these figures.

.NET Core 3.1

OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: True
Thread count: 79

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 107643, 1858.75, 79, 95
5, 1000000, 155812, 6418.24, 114, 114
10, 2000000, 156250, 12800.50, 123, 127
15, 3000000, 156921, 19118.55, 135, 115
20, 4000000, 156937, 25488.62, 123, 104
25, 5000000, 156529, 31943.29, 120, 103
30, 6000000, 156871, 38248.70, 117, 87

Conclusion

Across both configurations of our RemotePingPong benchmark there is a reproducible, consistent drop in performance once the .NET 6 threadpool begins scaling down the number of total threads in-process. The .NET Core 3.1 threadpool also decreases thread counts around those intervals, but with no significant loss in performance.

@KieranBond
Copy link

Thanks for getting back so quick and investigating some more @Aaronontheweb.
Any suggestions on your end? Is there a plan moving forward to tackle this?

@Aaronontheweb
Copy link
Member Author

@KieranBond so here's the good news, I think - the perf loss stemming from .NET 6 happens when the .NET ThreadPool is trimming threads. We're running a single Akka.Remote connection under max load here (it's hitting the EndpointReader bottleneck, which is a flow control issue we're tasked with fixing in v1.5) - and the thread pool will start eventually pruning under-utilized threads per its hill-climbing algorithm, depending on the load. The good news is I suspect this will only happen:

  1. Shortly after process startup or
  2. Shortly after a significant drop in load.

I don't think you should have any major performance issues running .NET 6 in a long-lived application. I ran .NET 6 for hours inside a not-extremely-busy, but continuously busy large Akka.NET cluster without seeing any changes in the thread count in either direction.

image

On a 16 vCPU machine, the thread count stayed at around 66 once the cluster was bigger than ~40 nodes and then stayed that way all throughout the entire deployment, eventually 200 nodes.

@KieranBond
Copy link

KieranBond commented May 6, 2022

Another question - These experiments you're performing, obviously are testing a very high throughput but do you think they reflect as well in less busy systems? i.e is this performance worth worrying about if your system is not maximizing Akka throughput and hitting around the 30k msg/s mark?

@KieranBond
Copy link

KieranBond commented May 6, 2022

Another question - These experiments you're performing, obviously are testing a very high throughput but do you think they reflect as well in less busy systems? i.e is this performance worth worrying about if your system is not maximizing Akka throughput and hitting around the 30k msg/s mark?

Think you've just answered my question as I asked it!

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented May 6, 2022

@KieranBond glad to help! While I think there is a real issue with the .NET ThreadPool here, I'm going to close this issue because:

  1. The RemotePingPong benchmark is an outlier - I've never seen a production setup that goes from instantiating a new process -> 100% peak Akka.Remoting load in a matter of seconds - the horizontally scalable nature of Akka.Remote usually mitigates that as the work gets spread out among the individual ActorSystems;
  2. The issue only occurs under temporary conditions that don't happen often, especially on systems that are consistently busy with at least some work to do even if it's lightweight (as my data from large-scale testing shows, thread counts don't change much in prod when there's work to do.)
  3. I don't want to give the impression that this is something that's going to have adverse effects on production users, now that I understand under what conditions this issue occurs. It seems like something I'm more likely to run into in a benchmark than in a production usage scenario.

@Aaronontheweb
Copy link
Member Author

Adding some additional benchmarks to Akka.NET to attempt to measure this. The most interesting reading is here so far: #6127 (comment)

@Aaronontheweb
Copy link
Member Author

New theory regarding this is that it might actually be GC changes introduced in .NET6 that is the root cause of the issue here rather than the thread pool.

Started some experiments with server-mode GC disabled to see if I can reproduce the perf drop - I can and it's even more noticeable between .NET Core 3.1 and .NET 6.

Uses the latest v1.5 bits.

.NET Core 3.1

OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: False
Thread count: 97

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 110620, 1808.04, 97, 106
5, 1000000, 148523, 6733.10, 115, 130
10, 2000000, 141343, 14150.04, 138, 128
15, 3000000, 150595, 19921.44, 139, 111
20, 4000000, 149500, 26756.90, 119, 106
25, 5000000, 148140, 33752.48, 114, 104
30, 6000000, 148858, 40307.67, 113, 102

.NET 6

OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: False
Thread count: 97

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 103040, 1941.88, 97, 112
5, 1000000, 164312, 6086.53, 120, 127
10, 2000000, 160129, 12490.72, 135, 127
15, 3000000, 144572, 20751.23, 135, 113
20, 4000000, 134256, 29794.30, 121, 108
25, 5000000, 133483, 37458.73, 116, 109
30, 6000000, 139099, 43135.17, 117, 107

@Aaronontheweb
Copy link
Member Author

Changed RemotePingPong to use manual GC:

private static async Task Start(uint timesToRun)
{         
    for (var i = 0; i < timesToRun; i++)
    {
        var redCount = 0;
        var bestThroughput = 0L;
        foreach (var throughput in GetClientSettings())
        {
            GC.Collect(); // before we start
            var result1 = await Benchmark(throughput, repeat, bestThroughput, redCount);
            bestThroughput = result1.Item2;
            redCount = result1.Item3;
            GC.Collect(); // after we finish for good measure
        }
    }

    Console.ForegroundColor = ConsoleColor.Gray;
    Console.WriteLine("Done..");
}

And I kept server and concurrent GC both disabled across both sets of benchmarks.

<PropertyGroup>
    <ServerGarbageCollection>false</ServerGarbageCollection>
    <ConcurrentGarbageCollection>false</ConcurrentGarbageCollection>
</PropertyGroup>

.NET Core 3.1

OSVersion: Microsoft Windows NT 6.2.9200.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: False
Thread count: 96

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 109111, 1833.03, 96, 124
5, 1000000, 148611, 6729.97, 132, 132
10, 2000000, 149533, 13375.97, 140, 130
15, 3000000, 155513, 19291.87, 139, 110
20, 4000000, 158103, 25300.11, 118, 105
25, 5000000, 158705, 31505.55, 115, 104
30, 6000000, 158112, 37948.17, 112, 100

.NET 6

OSVersion: Microsoft Windows NT 10.0.19044.0
ProcessorCount: 16
ClockSpeed: 0 MHZ
Actor Count: 32
Messages sent/received per client: 200000 (2e5)
Is Server GC: False
Thread count: 96

Num clients, Total [msg], Msgs/sec, Total [ms], Start Threads, End Threads
1, 200000, 117995, 1695.31, 96, 121
5, 1000000, 158053, 6327.16, 129, 129
10, 2000000, 159034, 12576.84, 137, 128
15, 3000000, 168209, 17835.17, 137, 112
20, 4000000, 151631, 26380.99, 120, 108
25, 5000000, 154962, 32266.80, 116, 106
30, 6000000, 157588, 38074.26, 115, 104

Still see a perf drop around clients=20, but it's not nearly as pronounced.

@Zetanova
Copy link
Contributor

@Aaronontheweb One of the reasons I wanted to remove the unique 'useless' instance of ActorPathFactory for ever ActorPath to reduce the instance count, highly likely that they make it to GC1 or GC2 . But you reverse the PR because of some untested binary comp. issue with the old DI system.
Can I make the PR again into 1.5 ?

@Aaronontheweb
Copy link
Member Author

@Zetanova oh yeah, can definitely submit for v1.5 - the old DI is no longer supported and already removed.

@Aaronontheweb
Copy link
Member Author

Going to reopen this because actual end-users have been reporting issues here over the past month

@Aaronontheweb Aaronontheweb reopened this Oct 19, 2022
@ismaelhamed
Copy link
Member

Any improvement in .NET7, now that it's out?

@Aaronontheweb
Copy link
Member Author

@ismaelhamed I actually have an update on this for .NET 6! I'll publish after my morning meeting.

@Aaronontheweb
Copy link
Member Author

So I believe this issue was identified and resolved in May of this year (2022): dotnet/runtime#68881 - this fix was released in .NET runtime 6.0.6

The original bug in .NET 6 basically caused thread pause / SpinWait to increase its spin iterations over time, making them more expensive until they could be resampled. Looks like the CosmosDb team ran into this bug also.

After working with the CoreCLR team and producing some detailed GC / ThreadPool / CPU sampling metrics in PerfView, that issue was outed as the likely cause of the reproducible .NET 6 performance drop.

I upgraded my local environment to the latest versions of .NET 6 SDK & Runtime, the performance regression is gone:

image

Fix

Upgrade your .NET 6 runtimes to at least 6.0.6 - preferably go all the way to the latest version (6.0.11)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants