-
Hello 👋 I recently discovered the CollectionsMarshal class and decided to write a simple benchmark using Dictionary to investigate the impact on performance. The benchmark is straightforward:
For example, below is the benchmark for both a class and a struct with a single [Params(100_000)]
public int Length { get; set; }
public class Class1
{
public int IntProp1 { get; set; }
public void DoWork(int i) => IntProp1 += i + i;
}
public struct Struct1
{
public int IntProp1 { get; set; }
public void DoWork(int i) => IntProp1 += i + i;
}
Dictionary<int, Class1> _class1 = [];
Dictionary<int, Struct1> _struct1 = [];
[Benchmark]
public Dictionary<int, Struct1> Struct1()
{
for (int i = 0; i < Length; i++)
{
var obj = _struct1[i];
obj.DoWork(i);
_struct1[i] = obj;
}
return _struct1;
}
[Benchmark]
public Dictionary<int, Struct1> Struct1ColllectionsMarshal()
{
for (int i = 0; i < Length; i++)
{
ref Struct1 obj = ref CollectionsMarshal.GetValueRefOrNullRef(_struct1, i);
obj.DoWork(i);
}
return _struct1;
}
[Benchmark]
public Dictionary<int, Class1> Class1()
{
for (int i = 0; i < Length; i++)
{
_class1[i].DoWork(i);
}
return _class1;
} When I wrote the code, I expected the following performance order: structures using
My assumption was that this difference could be due to inlining, but I also verified the results with inlining disabled.
So, my question is, how is that possible? Why are plain structures, especially those with 25 int properties (which is 100 bytes), faster than classes? |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
Hmm, interesting. If i execute the code in your report in my own console app project (with a single int property), the Struct1 benchmark is consistently and significantly slower than the Class1 benchmark. Example benchmark run:
and executing as x86 on the same CPU:
|
Beta Was this translation helpful? Give feedback.
-
I also did run your speed-up-the-dictionary project you linked to. It took a while, but again the Struct1..25 benchmarks were much slower than the Class1..25 benchmarks. I wonder what is going on here. Could you perhaps compare your environment and CPU to what i posted?
(The Struct20 bench result looks to be spoiled. I guess some background activity or something got in the way...) |
Beta Was this translation helpful? Give feedback.
-
Seeing that you used two mobile CPUs that agreed with each other, while i used a desktop CPU, i executed your benchmarks also on a mobile CPU (it's a rather old and low-end CPU, but it's the only x64-based mobile CPU i had at hand). Well, and my cheap, low-end, ancient mobile CPU agrees with your mobile CPUs, too. Mostly. Struct20 and Struct25 are slower than Class20 and Class25. BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 10 (10.0.19045.4170/22H2/2022Update)
Intel Core i5-3210M CPU 2.50GHz (Ivy Bridge), 1 CPU, 4 logical and 2 physical cores
[Host] : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX
Job-HIQTOA : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX
Toolchain=.NET 8.0
| Method | Categories | Length | Mean | Error | StdDev | Code Size | Allocated |
|-------------------------------------- |------------------------------------ |------- |-----------:|---------:|---------:|----------:|----------:|
| Class1 | Class | 100000 | 2,337.3 us | 11.37 us | 10.64 us | 72 B | 2 B |
| Class5 | Class | 100000 | 2,658.7 us | 6.16 us | 4.81 us | 84 B | 2 B |
| Class10 | Class | 100000 | 3,248.2 us | 6.21 us | 5.50 us | 99 B | 2 B |
| Class15 | Class | 100000 | 3,777.9 us | 13.90 us | 10.85 us | 76 B | 2 B |
| Class20 | Class | 100000 | 3,987.8 us | 28.96 us | 24.18 us | 76 B | 3 B |
| Class25 | Class | 100000 | 4,416.4 us | 9.42 us | 8.81 us | 76 B | 1 B |
...
| Struct1 | Struct | 100000 | 1,445.4 us | 4.39 us | 3.66 us | 89 B | 1 B |
| Struct5 | Struct | 100000 | 1,799.8 us | 2.61 us | 2.32 us | 149 B | - |
| Struct10 | Struct | 100000 | 2,315.2 us | 2.74 us | 2.43 us | 229 B | - |
| Struct15 | Struct | 100000 | 2,867.9 us | 9.41 us | 8.81 us | 346 B | 2 B |
| Struct20 | Struct | 100000 | 4,507.9 us | 2.51 us | 1.96 us | 317 B | 3 B |
| Struct25 | Struct | 100000 | 4,842.2 us | 9.78 us | 8.67 us | 360 B | 3 B | This is most curious, that the results are flipped on my 7800X3D compared to three different mobile CPUs (with the exception of Struct20/Class20 and Struct25/Class25 on my ancient mobile CPU). I don't really know what is going on here. But could this perhaps be related to the cache sizes or caching strategies employed by the CPUs? Like the JIT compiler perhaps producing native code for the struct-based benchmarks that is more suitable for small-ish caches, but which reaps no benefits the larger the structs become for CPUs with rather small caches (the i5-3210M i used has only 3 MB L3 cache), and also reaps no benefit over the code generated for the class-based benchmarks on CPUs with comparatively large caches? Just to be clear: i am just speculating here, taking blind shots in the dark. I am no expert in matters of the intricate details of how different CPU archs behave differently while executing the same code that is processing the same data. I might very well also just happen to bark up the wrong tree here, with my speculations being entirely misguided and the cause of the diverging results between the mobile CPUs and the 7800X3D being something entirely different. |
Beta Was this translation helpful? Give feedback.
-
It should certainly be related with memory locality. You can try to randomize the objects allocated: Class1 objects = new Class1[Length];
for (i = 0; i < Length; i++)
objects[i] = new Class1();
Random.Shared.Shuffle(objects); // Randomize the object spread with index
for (i = 0; i < Length; i++)
dict1[i] = objects[i]; // Still keep index in dictionary sequential I'd expect the performance of class to be more impacted for this case. |
Beta Was this translation helpful? Give feedback.
-
I've found the root cause of the problem. @elgonzo @huoyaoyuan you both were right - it is related to data locality and cache. :) I added hardware counters to the benchmark to see cache misses. Here are the results that I got:
Then I started thinking, why there were so many cache misses for classes, and I realized that I was creating instances of the classes one after another. This resulted in poor data locality and cache misses. for (var i = 0; i < Length; i++)
{
_class1.Add(i, new Class1());
_class5.Add(i, new Class5());
_class10.Add(i, new Class10());
_class15.Add(i, new Class15());
_class20.Add(i, new Class20());
_class25.Add(i, new Class25());
_struct1.Add(i, new Struct1());
_struct5.Add(i, new Struct5());
_struct10.Add(i, new Struct10());
_struct15.Add(i, new Struct15());
_struct20.Add(i, new Struct20());
_struct25.Add(i, new Struct25());
} After fixing the code (looks awful, but anyway), the results look reliable. for (var i = 0; i < Length; i++) _class1.Add(i, new Class1());
for (var i = 0; i < Length; i++) _class5.Add(i, new Class5());
for (var i = 0; i < Length; i++) _class10.Add(i, new Class10());
for (var i = 0; i < Length; i++) _class15.Add(i, new Class15());
for (var i = 0; i < Length; i++) _class20.Add(i, new Class20());
for (var i = 0; i < Length; i++) _class25.Add(i, new Class25());
for (var i = 0; i < Length; i++) _struct1.Add(i, new Struct1());
for (var i = 0; i < Length; i++) _struct5.Add(i, new Struct5());
for (var i = 0; i < Length; i++) _struct10.Add(i, new Struct10());
for (var i = 0; i < Length; i++) _struct15.Add(i, new Struct15());
for (var i = 0; i < Length; i++) _struct20.Add(i, new Struct20());
for (var i = 0; i < Length; i++) _struct25.Add(i, new Struct25()); Now, classes are faster than structs as they should be. Thank you @elgonzo @huoyaoyuan. :)
|
Beta Was this translation helpful? Give feedback.
I've found the root cause of the problem. @elgonzo @huoyaoyuan you both were right - it is related to data locality and cache. :)
I added hardware counters to the benchmark to see cache misses. Here are the results that I got: