Why Dictionary is faster with structs than with classes? #100306

alexeyfv · 2024-03-26T19:53:57Z

alexeyfv
Mar 26, 2024

Hello 👋

I recently discovered the CollectionsMarshal class and decided to write a simple benchmark using Dictionary to investigate the impact on performance.

The benchmark is straightforward:

Extract an object from the dictionary by key.
Execute DoWork to change the object.
Save the modified object to the dictionary.

For example, below is the benchmark for both a class and a struct with a single int property. I also created similar benchmarks for classes and structures up to 100 bytes. The code of the benchmark as well as the results can be found here.

[Params(100_000)]
public int Length { get; set; }

public class Class1
{
    public int IntProp1 { get; set; }
    public void DoWork(int i) => IntProp1 += i + i;
}

public struct Struct1
{
    public int IntProp1 { get; set; }
    public void DoWork(int i) => IntProp1 += i + i;
}

Dictionary<int, Class1> _class1 = [];
Dictionary<int, Struct1> _struct1 = [];

[Benchmark]
public Dictionary<int, Struct1> Struct1()
{
    for (int i = 0; i < Length; i++)
    {
        var obj = _struct1[i];
        obj.DoWork(i);
        _struct1[i] = obj;
    }

    return _struct1;
}

[Benchmark]
public Dictionary<int, Struct1> Struct1ColllectionsMarshal()
{
    for (int i = 0; i < Length; i++)
    {
        ref Struct1 obj = ref CollectionsMarshal.GetValueRefOrNullRef(_struct1, i);
        obj.DoWork(i);
    }

    return _struct1;
}

[Benchmark]
public Dictionary<int, Class1> Class1()
{
    for (int i = 0; i < Length; i++)
    {
        _class1[i].DoWork(i);
    }

    return _class1;
}

When I wrote the code, I expected the following performance order: structures using CollectionsMarshal would be the fastest, classes in 2nd place, and plain structures the slowest. However, the results are different:

No of int props	Class	Struct	Struct + CollectionsMarshall
1	1,965.8 μs	1,293.2 μs	867.3 μs
5	2,129.9 μs	1,502.9 μs	825.4 μs
10	2,564.8 μs	1,577.0 μs	1,132.3 μs
15	3,906.2 μs	2,439.7 μs	1,257.4 μs
20	4,388.8 μs	3,087.8 μs	1,539.7 μs
25	4,664.4 μs	3,415.5 μs	1,706.0 μs

My assumption was that this difference could be due to inlining, but I also verified the results with inlining disabled.

No of int props	Class	Struct NoInlining
1	1,965.8 μs	1,268.5 μs
5	2,129.9 μs	1,825.5 μs
10	2,564.8 μs	2,261.0 μs
15	3,906.2 μs	3,142.6 μs
20	4,388.8 μs	3,122.1 μs
25	4,664.4 μs	3,402.6 μs

So, my question is, how is that possible? Why are plain structures, especially those with 25 int properties (which is 100 bytes), faster than classes?

Answered by alexeyfv

Mar 27, 2024

I've found the root cause of the problem. @elgonzo @huoyaoyuan you both were right - it is related to data locality and cache. :)

I added hardware counters to the benchmark to see cache misses. Here are the results that I got:

| Method   | Mean      | CacheMisses/Op |
|--------- |----------:|---------------:|
| Class1   |  26.89 ms |        771,546 |
| Class5   |  34.98 ms |      1,434,310 |
| Class10  |  46.31 ms |      2,177,061 |
| Class15  |  56.28 ms |      2,571,176 |
| Class20  |  63.04 ms |      3,002,823 |
| Class25  |  70.86 ms |      3,313,459 |
| Struct1  |  18.33 ms |         85,184 |
| Struct5  |  21.53 ms |        125,696 |
| Struct10 |  27.00 ms |        190,490 |
| Struct…

View full answer

elgonzo · 2024-03-26T20:49:28Z

elgonzo
Mar 26, 2024

Hmm, interesting. If i execute the code in your report in my own console app project (with a single int property), the Struct1 benchmark is consistently and significantly slower than the Class1 benchmark. Example benchmark run:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 7 7800X3D, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                     | Length | Mean     | Error   | StdDev  |
|--------------------------- |------- |---------:|--------:|--------:|
| Struct1                    | 100000 | 578.8 us | 1.27 us | 1.19 us |
| Struct1ColllectionsMarshal | 100000 | 292.4 us | 2.20 us | 2.06 us |
| Class1                     | 100000 | 306.2 us | 0.43 us | 0.34 us |

and executing as x86 on the same CPU:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3296/23H2/2023Update/SunValley3)
AMD Ryzen 7 7800X3D, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host] : .NET 8.0.1 (8.0.123.58001), X86 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=MediumRun  Toolchain=InProcessNoEmitToolchain  IterationCount=15
LaunchCount=2  WarmupCount=10

| Method                     | Length | Mean     | Error   | StdDev   |
|--------------------------- |------- |---------:|--------:|---------:|
| Struct1                    | 100000 | 996.4 us | 7.35 us | 10.78 us |
| Struct1ColllectionsMarshal | 100000 | 400.7 us | 3.63 us |  5.43 us |
| Class1                     | 100000 | 435.0 us | 1.89 us |  2.83 us |

0 replies

elgonzo · 2024-03-26T21:14:40Z

elgonzo
Mar 26, 2024

I also did run your speed-up-the-dictionary project you linked to. It took a while, but again the Struct1..25 benchmarks were much slower than the Class1..25 benchmarks.

I wonder what is going on here. Could you perhaps compare your environment and CPU to what i posted?

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.22631.3296)
AMD Ryzen 7 7800X3D, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.101
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2


| Method                                | Categories                          | Length | Mean       | Error    | StdDev   | Code Size | Allocated |
|-------------------------------------- |------------------------------------ |------- |-----------:|---------:|---------:|----------:|----------:|
| Class1                                | Class                               | 100000 |   377.4 us |  3.79 us |  3.55 us |      72 B |         - |
| Class5                                | Class                               | 100000 |   453.0 us |  3.99 us |  3.73 us |      84 B |         - |
| Class10                               | Class                               | 100000 |   517.1 us |  0.78 us |  0.65 us |      99 B |         - |
| Class15                               | Class                               | 100000 |   646.7 us |  5.99 us |  5.61 us |      76 B |         - |
| Class20                               | Class                               | 100000 |   814.2 us |  1.38 us |  1.15 us |      76 B |         - |
| Class25                               | Class                               | 100000 | 1,027.6 us |  2.66 us |  2.36 us |      76 B |       1 B |
| ClassCollectionsMarshall1             | ClassCollectionsMarshall            | 100000 |   378.6 us |  0.56 us |  0.53 us |      58 B |         - |
| ClassCollectionsMarshall5             | ClassCollectionsMarshall            | 100000 |   428.9 us |  3.91 us |  3.66 us |      70 B |         - |
| ClassCollectionsMarshall10            | ClassCollectionsMarshall            | 100000 |   516.9 us |  0.93 us |  0.77 us |      85 B |         - |
| ClassCollectionsMarshall15            | ClassCollectionsMarshall            | 100000 |   646.7 us |  1.21 us |  1.14 us |      62 B |         - |
| ClassCollectionsMarshall20            | ClassCollectionsMarshall            | 100000 |   808.0 us |  2.92 us |  2.44 us |      62 B |         - |
| ClassCollectionsMarshall25            | ClassCollectionsMarshall            | 100000 | 1,017.5 us |  1.81 us |  1.70 us |      62 B |       1 B |
| ClassCollectionsMarshall1NoInlining   | ClassCollectionsMarshallNoInlining  | 100000 |   421.5 us |  0.61 us |  0.57 us |      62 B |         - |
| ClassCollectionsMarshall5NoInlining   | ClassCollectionsMarshallNoInlining  | 100000 |   470.3 us |  4.05 us |  3.79 us |      62 B |         - |
| ClassCollectionsMarshall10NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 |   556.1 us |  3.13 us |  2.93 us |      62 B |         - |
| ClassCollectionsMarshall15NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 |   637.0 us |  1.18 us |  0.92 us |      62 B |         - |
| ClassCollectionsMarshall20NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 |   802.6 us |  2.05 us |  1.92 us |      62 B |         - |
| ClassCollectionsMarshall25NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 | 1,024.0 us |  8.06 us |  7.54 us |      62 B |       1 B |
| Class1NoInlining                      | ClassNoInlining                     | 100000 |   431.9 us |  0.66 us |  0.61 us |      76 B |         - |
| Class5NoInlining                      | ClassNoInlining                     | 100000 |   472.2 us |  4.77 us |  4.46 us |      76 B |         - |
| Class10NoInlining                     | ClassNoInlining                     | 100000 |   579.0 us |  1.51 us |  1.34 us |      76 B |         - |
| Class15NoInlining                     | ClassNoInlining                     | 100000 |   637.7 us |  0.85 us |  0.71 us |      76 B |         - |
| Class20NoInlining                     | ClassNoInlining                     | 100000 |   817.2 us |  9.42 us |  8.81 us |      76 B |         - |
| Class25NoInlining                     | ClassNoInlining                     | 100000 | 1,035.4 us |  8.06 us |  7.54 us |      76 B |       1 B |
| Struct1                               | Struct                              | 100000 |   585.2 us |  2.85 us |  2.67 us |      89 B |         - |
| Struct5                               | Struct                              | 100000 |   688.2 us |  3.90 us |  3.65 us |     149 B |         - |
| Struct10                              | Struct                              | 100000 |   846.2 us |  4.31 us |  4.03 us |     229 B |         - |
| Struct15                              | Struct                              | 100000 | 1,053.2 us |  2.31 us |  2.05 us |     346 B |       1 B |
| Struct20                              | Struct                              | 100000 | 3,113.9 us | 42.55 us | 39.81 us |     291 B |       2 B |
| Struct25                              | Struct                              | 100000 | 2,331.3 us |  6.82 us |  6.38 us |     306 B |       2 B |
| Struct1ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   293.2 us |  2.05 us |  1.92 us |      54 B |         - |
| Struct5ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   356.1 us |  1.86 us |  1.74 us |      66 B |         - |
| Struct10ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   433.5 us |  1.79 us |  1.68 us |      81 B |         - |
| Struct15ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   495.6 us |  1.61 us |  1.43 us |      96 B |         - |
| Struct20ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   628.3 us |  1.61 us |  1.51 us |      60 B |         - |
| Struct25ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   766.3 us |  1.51 us |  1.41 us |      60 B |         - |
| Struct1ColllectionsMarshalNoInlining  | StructCollectionsMarshallNoInlining | 100000 |   323.1 us |  0.68 us |  0.64 us |      60 B |         - |
| Struct5ColllectionsMarshalNoInlining  | StructCollectionsMarshallNoInlining | 100000 |   377.5 us |  1.93 us |  1.80 us |      60 B |         - |
| Struct10ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 |   467.9 us |  2.64 us |  2.47 us |      60 B |         - |
| Struct15ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 |   517.2 us |  1.48 us |  1.39 us |      60 B |         - |
| Struct20ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 |   628.1 us |  2.23 us |  2.09 us |      60 B |         - |
| Struct25ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 |   767.2 us |  1.67 us |  1.56 us |      60 B |         - |
| Struct1NoInlining                     | StructNoInlining                    | 100000 |   647.4 us |  1.21 us |  1.13 us |     114 B |         - |
| Struct5NoInlining                     | StructNoInlining                    | 100000 |   905.9 us |  4.71 us |  4.40 us |     180 B |         - |
| Struct10NoInlining                    | StructNoInlining                    | 100000 | 1,047.6 us |  2.14 us |  2.00 us |     220 B |       1 B |
| Struct15NoInlining                    | StructNoInlining                    | 100000 | 1,903.9 us |  5.29 us |  4.95 us |     261 B |       1 B |
| Struct20NoInlining                    | StructNoInlining                    | 100000 | 2,018.7 us |  7.45 us |  6.60 us |     291 B |       2 B |
| Struct25NoInlining                    | StructNoInlining                    | 100000 | 2,343.9 us |  8.58 us |  7.17 us |     306 B |       2 B |

(The Struct20 bench result looks to be spoiled. I guess some background activity or something got in the way...)

2 replies

alexeyfv Mar 27, 2024
Author

Yes, sure. Here are the results. I'm using a different CPU and .NET. I'll try to update .NET version and then recheck the benchmark.

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 11 (10.0.22631.3296)
AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

| Method                                | Categories                          | Length | Mean       | Error    | StdDev   | Code Size | Allocated |
|-------------------------------------- |------------------------------------ |------- |-----------:|---------:|---------:|----------:|----------:|
| Class1                                | Class                               | 100000 | 1,965.8 μs |  5.86 μs |  5.48 μs |      72 B |       2 B |
| Class5                                | Class                               | 100000 | 2,129.9 μs |  3.43 μs |  2.86 μs |      84 B |       2 B |
| Class10                               | Class                               | 100000 | 2,564.8 μs | 10.76 μs |  8.98 μs |      99 B |       2 B |
| Class15                               | Class                               | 100000 | 3,906.2 μs |  4.08 μs |  3.81 μs |      76 B |       3 B |
| Class20                               | Class                               | 100000 | 4,388.8 μs |  2.37 μs |  2.22 μs |      76 B |       3 B |
| Class25                               | Class                               | 100000 | 4,664.4 μs |  4.73 μs |  4.42 μs |      76 B |       3 B |
| ClassCollectionsMarshall1             | ClassCollectionsMarshall            | 100000 | 1,894.8 μs |  2.52 μs |  2.36 μs |      58 B |       2 B |
| ClassCollectionsMarshall5             | ClassCollectionsMarshall            | 100000 | 2,116.2 μs |  4.29 μs |  3.35 μs |      70 B |       2 B |
| ClassCollectionsMarshall10            | ClassCollectionsMarshall            | 100000 | 2,569.4 μs |  2.36 μs |  1.84 μs |      85 B |       2 B |
| ClassCollectionsMarshall15            | ClassCollectionsMarshall            | 100000 | 3,770.6 μs |  4.28 μs |  3.58 μs |      62 B |       2 B |
| ClassCollectionsMarshall20            | ClassCollectionsMarshall            | 100000 | 4,188.3 μs | 32.02 μs | 28.39 μs |      62 B |       3 B |
| ClassCollectionsMarshall25            | ClassCollectionsMarshall            | 100000 | 4,691.6 μs |  5.27 μs |  4.67 μs |      62 B |       3 B |
| ClassCollectionsMarshall1NoInlining   | ClassCollectionsMarshallNoInlining  | 100000 | 1,927.9 μs |  9.36 μs |  8.30 μs |      62 B |       2 B |
| ClassCollectionsMarshall5NoInlining   | ClassCollectionsMarshallNoInlining  | 100000 | 2,608.4 μs | 11.49 μs | 10.18 μs |      62 B |       2 B |
| ClassCollectionsMarshall10NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 | 2,859.6 μs |  3.16 μs |  2.47 μs |      62 B |       2 B |
| ClassCollectionsMarshall15NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 | 3,824.1 μs |  7.08 μs |  5.92 μs |      62 B |       2 B |
| ClassCollectionsMarshall20NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 | 4,160.8 μs | 14.92 μs | 13.23 μs |      62 B |       3 B |
| ClassCollectionsMarshall25NoInlining  | ClassCollectionsMarshallNoInlining  | 100000 | 4,669.4 μs | 11.98 μs | 10.00 μs |      62 B |       3 B |
| Class1NoInlining                      | ClassNoInlining                     | 100000 | 1,922.6 μs |  1.97 μs |  1.75 μs |      76 B |       2 B |
| Class5NoInlining                      | ClassNoInlining                     | 100000 | 2,579.2 μs | 24.97 μs | 23.36 μs |      76 B |       2 B |
| Class10NoInlining                     | ClassNoInlining                     | 100000 | 3,086.8 μs |  3.21 μs |  2.68 μs |      76 B |       2 B |
| Class15NoInlining                     | ClassNoInlining                     | 100000 | 3,930.5 μs |  9.01 μs |  7.53 μs |      76 B |       3 B |
| Class20NoInlining                     | ClassNoInlining                     | 100000 | 4,410.6 μs |  3.82 μs |  3.57 μs |      76 B |       3 B |
| Class25NoInlining                     | ClassNoInlining                     | 100000 | 4,756.2 μs |  3.84 μs |  3.40 μs |      76 B |       3 B |
| Struct1                               | Struct                              | 100000 | 1,293.2 μs | 24.72 μs | 23.12 μs |      89 B |       1 B |
| Struct5                               | Struct                              | 100000 | 1,502.9 μs | 27.21 μs | 25.45 μs |     149 B |       1 B |
| Struct10                              | Struct                              | 100000 | 1,577.0 μs |  6.96 μs |  6.51 μs |     229 B |       1 B |
| Struct15                              | Struct                              | 100000 | 2,439.7 μs | 16.20 μs | 15.15 μs |     346 B |       2 B |
| Struct20                              | Struct                              | 100000 | 3,087.8 μs | 32.46 μs | 27.11 μs |     317 B |       2 B |
| Struct25                              | Struct                              | 100000 | 3,415.5 μs | 27.10 μs | 25.35 μs |     360 B |       2 B |
| Struct1ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   867.3 μs |  6.38 μs |  5.96 μs |      54 B |         - |
| Struct5ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   825.4 μs |  0.94 μs |  0.88 μs |      66 B |         - |
| Struct10ColllectionsMarshal           | StructCollectionsMarshall           | 100000 | 1,132.3 μs |  3.42 μs |  3.20 μs |      81 B |       1 B |
| Struct15ColllectionsMarshal           | StructCollectionsMarshall           | 100000 | 1,257.4 μs |  1.16 μs |  0.97 μs |      96 B |       1 B |
| Struct20ColllectionsMarshal           | StructCollectionsMarshall           | 100000 | 1,539.7 μs | 22.26 μs | 20.82 μs |      60 B |       1 B |
| Struct25ColllectionsMarshal           | StructCollectionsMarshall           | 100000 | 1,706.0 μs |  4.31 μs |  3.60 μs |      60 B |       1 B |
| Struct1ColllectionsMarshalNoInlining  | StructCollectionsMarshallNoInlining | 100000 |   772.4 μs |  1.49 μs |  1.40 μs |      60 B |         - |
| Struct5ColllectionsMarshalNoInlining  | StructCollectionsMarshallNoInlining | 100000 |   922.0 μs |  1.57 μs |  1.47 μs |      60 B |         - |
| Struct10ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 | 1,179.4 μs |  1.32 μs |  1.17 μs |      60 B |       1 B |
| Struct15ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 | 1,328.8 μs |  7.39 μs |  6.55 μs |      60 B |       1 B |
| Struct20ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 | 1,541.5 μs |  5.28 μs |  4.68 μs |      60 B |       1 B |
| Struct25ColllectionsMarshalNoInlining | StructCollectionsMarshallNoInlining | 100000 | 1,738.5 μs |  1.71 μs |  1.51 μs |      60 B |       1 B |
| Struct1NoInlining                     | StructNoInlining                    | 100000 | 1,268.5 μs | 25.01 μs | 44.45 μs |     114 B |       1 B |
| Struct5NoInlining                     | StructNoInlining                    | 100000 | 1,825.5 μs |  3.43 μs |  2.86 μs |     180 B |       1 B |
| Struct10NoInlining                    | StructNoInlining                    | 100000 | 2,261.0 μs |  4.70 μs |  3.92 μs |     220 B |       2 B |
| Struct15NoInlining                    | StructNoInlining                    | 100000 | 3,142.6 μs |  7.57 μs |  6.71 μs |     261 B |       2 B |
| Struct20NoInlining                    | StructNoInlining                    | 100000 | 3,122.1 μs |  3.64 μs |  3.23 μs |     317 B |       2 B |
| Struct25NoInlining                    | StructNoInlining                    | 100000 | 3,402.6 μs | 26.94 μs | 25.20 μs |     360 B |       2 B |

alexeyfv Mar 27, 2024
Author

I ran the benchmark on my work laptop. It has the same .NET version, but different OS and CPU.

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 10 (10.0.19045.4170/22H2/2022Update)
12th Gen Intel Core i7-12800H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

| Method                                | Categories                          | Length | Mean       | Error    | StdDev    | Median     | Code Size | Allocated |
|-------------------------------------- |------------------------------------ |------- |-----------:|---------:|----------:|-----------:|----------:|----------:|
| Class1                                | Class                               | 100000 |   592.0 μs | 12.05 μs |  35.35 μs |   603.2 μs |      72 B |         - |
| Class5                                | Class                               | 100000 |   924.5 μs | 32.03 μs |  93.44 μs |   930.3 μs |      84 B |         - |
| Class10                               | Class                               | 100000 | 1,029.1 μs | 36.13 μs | 105.38 μs |   972.5 μs |      99 B |       1 B |
| Class15                               | Class                               | 100000 | 1,154.1 μs | 11.50 μs |  10.20 μs | 1,149.4 μs |      76 B |       1 B |
| Class20                               | Class                               | 100000 | 1,292.4 μs | 18.63 μs |  16.51 μs | 1,284.6 μs |      76 B |       1 B |
| Class25                               | Class                               | 100000 | 1,455.6 μs | 20.26 μs |  17.96 μs | 1,446.7 μs |      76 B |       1 B |
| Struct1                               | Struct                              | 100000 |   520.4 μs |  8.84 μs |   8.27 μs |   518.4 μs |      89 B |         - |
| Struct5                               | Struct                              | 100000 |   693.4 μs | 10.30 μs |   9.63 μs |   691.2 μs |     149 B |         - |
| Struct10                              | Struct                              | 100000 |   808.2 μs |  7.71 μs |   7.21 μs |   807.2 μs |     229 B |         - |
| Struct15                              | Struct                              | 100000 | 1,017.5 μs |  4.44 μs |   4.16 μs | 1,016.7 μs |     346 B |       1 B |
| Struct20                              | Struct                              | 100000 | 1,454.1 μs |  4.69 μs |   3.66 μs | 1,455.1 μs |     317 B |       1 B |
| Struct25                              | Struct                              | 100000 | 1,771.5 μs | 23.33 μs |  21.82 μs | 1,772.7 μs |     360 B |       1 B |
| Struct1ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   249.0 μs |  2.77 μs |   2.59 μs |   250.1 μs |      54 B |         - |
| Struct5ColllectionsMarshal            | StructCollectionsMarshall           | 100000 |   340.1 μs |  0.64 μs |   0.54 μs |   340.1 μs |      66 B |         - |
| Struct10ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   465.6 μs |  1.14 μs |   1.01 μs |   465.3 μs |      81 B |         - |
| Struct15ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   581.2 μs | 11.37 μs |   8.88 μs |   581.7 μs |      96 B |         - |
| Struct20ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   774.5 μs | 15.17 μs |  25.35 μs |   769.7 μs |      60 B |         - |
| Struct25ColllectionsMarshal           | StructCollectionsMarshall           | 100000 |   948.3 μs | 18.19 μs |  17.87 μs |   951.6 μs |      60 B |         - |
| Struct1NoInlining                     | StructNoInlining                    | 100000 |   588.0 μs |  4.15 μs |   3.24 μs |   588.0 μs |     114 B |         - |
| Struct5NoInlining                     | StructNoInlining                    | 100000 |   931.3 μs | 13.45 μs |  12.58 μs |   932.3 μs |     180 B |         - |
| Struct10NoInlining                    | StructNoInlining                    | 100000 | 1,038.9 μs |  9.29 μs |   8.23 μs | 1,039.7 μs |     220 B |       1 B |
| Struct15NoInlining                    | StructNoInlining                    | 100000 | 2,049.9 μs |  5.27 μs |   4.11 μs | 2,049.6 μs |     261 B |       1 B |
| Struct20NoInlining                    | StructNoInlining                    | 100000 | 1,461.7 μs |  5.24 μs |   4.65 μs | 1,462.7 μs |     317 B |       1 B |
| Struct25NoInlining                    | StructNoInlining                    | 100000 | 1,777.3 μs |  9.14 μs |   8.10 μs | 1,779.4 μs |     360 B |       1 B |

elgonzo · 2024-03-27T14:12:09Z

elgonzo
Mar 27, 2024

Seeing that you used two mobile CPUs that agreed with each other, while i used a desktop CPU, i executed your benchmarks also on a mobile CPU (it's a rather old and low-end CPU, but it's the only x64-based mobile CPU i had at hand).

Well, and my cheap, low-end, ancient mobile CPU agrees with your mobile CPUs, too. Mostly. Struct20 and Struct25 are slower than Class20 and Class25.

BenchmarkDotNet v0.13.9+228a464e8be6c580ad9408e98f18813f6407fb5a, Windows 10 (10.0.19045.4170/22H2/2022Update)
Intel Core i5-3210M CPU 2.50GHz (Ivy Bridge), 1 CPU, 4 logical and 2 physical cores
  [Host]     : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX
  Job-HIQTOA : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX

Toolchain=.NET 8.0

| Method                                | Categories                          | Length | Mean       | Error    | StdDev   | Code Size | Allocated |
|-------------------------------------- |------------------------------------ |------- |-----------:|---------:|---------:|----------:|----------:|
| Class1                                | Class                               | 100000 | 2,337.3 us | 11.37 us | 10.64 us |      72 B |       2 B |
| Class5                                | Class                               | 100000 | 2,658.7 us |  6.16 us |  4.81 us |      84 B |       2 B |
| Class10                               | Class                               | 100000 | 3,248.2 us |  6.21 us |  5.50 us |      99 B |       2 B |
| Class15                               | Class                               | 100000 | 3,777.9 us | 13.90 us | 10.85 us |      76 B |       2 B |
| Class20                               | Class                               | 100000 | 3,987.8 us | 28.96 us | 24.18 us |      76 B |       3 B |
| Class25                               | Class                               | 100000 | 4,416.4 us |  9.42 us |  8.81 us |      76 B |       1 B |
...
| Struct1                               | Struct                              | 100000 | 1,445.4 us |  4.39 us |  3.66 us |      89 B |       1 B |
| Struct5                               | Struct                              | 100000 | 1,799.8 us |  2.61 us |  2.32 us |     149 B |         - |
| Struct10                              | Struct                              | 100000 | 2,315.2 us |  2.74 us |  2.43 us |     229 B |         - |
| Struct15                              | Struct                              | 100000 | 2,867.9 us |  9.41 us |  8.81 us |     346 B |       2 B |
| Struct20                              | Struct                              | 100000 | 4,507.9 us |  2.51 us |  1.96 us |     317 B |       3 B |
| Struct25                              | Struct                              | 100000 | 4,842.2 us |  9.78 us |  8.67 us |     360 B |       3 B |

This is most curious, that the results are flipped on my 7800X3D compared to three different mobile CPUs (with the exception of Struct20/Class20 and Struct25/Class25 on my ancient mobile CPU).

I don't really know what is going on here. But could this perhaps be related to the cache sizes or caching strategies employed by the CPUs? Like the JIT compiler perhaps producing native code for the struct-based benchmarks that is more suitable for small-ish caches, but which reaps no benefits the larger the structs become for CPUs with rather small caches (the i5-3210M i used has only 3 MB L3 cache), and also reaps no benefit over the code generated for the class-based benchmarks on CPUs with comparatively large caches? If the dictionary were to use array-like structures (contiguous memory segments) to store the values, it might perhaps be that accessing the dictionary in the same order as the items have been added or where the key values used for access linearly increase is more conducive to cache pre-fetching, thus minimizing time spent on cache misses, which would benefit value types as there is no additional dereferencing necessary to get the actual instance... (Scratch that, randomizing the dictionary accesses doesn't change the outcome.)

Just to be clear: i am just speculating here, taking blind shots in the dark. I am no expert in matters of the intricate details of how different CPU archs behave differently while executing the same code that is processing the same data. I might very well also just happen to bark up the wrong tree here, with my speculations being entirely misguided and the cause of the diverging results between the mobile CPUs and the 7800X3D being something entirely different.

0 replies

huoyaoyuan · 2024-03-27T17:08:19Z

huoyaoyuan
Mar 27, 2024
Collaborator

It should certainly be related with memory locality. You can try to randomize the objects allocated:

Class1 objects = new Class1[Length];
for (i = 0; i < Length; i++)
    objects[i] = new Class1();
Random.Shared.Shuffle(objects); // Randomize the object spread with index
for (i = 0; i < Length; i++)
    dict1[i] = objects[i]; // Still keep index in dictionary sequential

I'd expect the performance of class to be more impacted for this case.

0 replies

alexeyfv · 2024-03-27T21:33:51Z

alexeyfv
Mar 27, 2024
Author

I've found the root cause of the problem. @elgonzo @huoyaoyuan you both were right - it is related to data locality and cache. :)

I added hardware counters to the benchmark to see cache misses. Here are the results that I got:

| Method   | Mean      | CacheMisses/Op |
|--------- |----------:|---------------:|
| Class1   |  26.89 ms |        771,546 |
| Class5   |  34.98 ms |      1,434,310 |
| Class10  |  46.31 ms |      2,177,061 |
| Class15  |  56.28 ms |      2,571,176 |
| Class20  |  63.04 ms |      3,002,823 |
| Class25  |  70.86 ms |      3,313,459 |
| Struct1  |  18.33 ms |         85,184 |
| Struct5  |  21.53 ms |        125,696 |
| Struct10 |  27.00 ms |        190,490 |
| Struct15 |  44.15 ms |        252,450 |
| Struct20 |  72.12 ms |        362,854 |
| Struct25 | 101.63 ms |        557,056 |

Then I started thinking, why there were so many cache misses for classes, and I realized that I was creating instances of the classes one after another. This resulted in poor data locality and cache misses.

for (var i = 0; i < Length; i++)
{
    _class1.Add(i, new Class1());
    _class5.Add(i, new Class5());
    _class10.Add(i, new Class10());
    _class15.Add(i, new Class15());
    _class20.Add(i, new Class20());
    _class25.Add(i, new Class25());

    _struct1.Add(i, new Struct1());
    _struct5.Add(i, new Struct5());
    _struct10.Add(i, new Struct10());
    _struct15.Add(i, new Struct15());
    _struct20.Add(i, new Struct20());
    _struct25.Add(i, new Struct25());
}

After fixing the code (looks awful, but anyway), the results look reliable.

for (var i = 0; i < Length; i++) _class1.Add(i, new Class1());
for (var i = 0; i < Length; i++) _class5.Add(i, new Class5());
for (var i = 0; i < Length; i++) _class10.Add(i, new Class10());
for (var i = 0; i < Length; i++) _class15.Add(i, new Class15());
for (var i = 0; i < Length; i++) _class20.Add(i, new Class20());
for (var i = 0; i < Length; i++) _class25.Add(i, new Class25());
for (var i = 0; i < Length; i++) _struct1.Add(i, new Struct1());
for (var i = 0; i < Length; i++) _struct5.Add(i, new Struct5());
for (var i = 0; i < Length; i++) _struct10.Add(i, new Struct10());
for (var i = 0; i < Length; i++) _struct15.Add(i, new Struct15());
for (var i = 0; i < Length; i++) _struct20.Add(i, new Struct20());
for (var i = 0; i < Length; i++) _struct25.Add(i, new Struct25());

Now, classes are faster than structs as they should be. Thank you @elgonzo @huoyaoyuan. :)

| Method   | Mean     | CacheMisses/Op |
|--------- |---------:|---------------:|
| Class1   | 1.231 ms |          9,132 |
| Class5   | 1.418 ms |         13,337 |
| Class10  | 1.745 ms |         17,801 |
| Class15  | 2.343 ms |         22,379 |
| Class20  | 2.780 ms |         24,534 |
| Class25  | 3.373 ms |         30,493 |
| Struct1  | 1.616 ms |          5,003 |
| Struct5  | 2.029 ms |          9,210 |
| Struct10 | 2.552 ms |         15,894 |
| Struct15 | 4.288 ms |         26,898 |
| Struct20 | 5.743 ms |         32,496 |
| Struct25 | 6.988 ms |         40,474 |

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Dictionary is faster with structs than with classes? #100306

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why Dictionary is faster with structs than with classes? #100306

alexeyfv Mar 26, 2024

Replies: 5 comments · 2 replies

elgonzo Mar 26, 2024

elgonzo Mar 26, 2024

alexeyfv Mar 27, 2024 Author

alexeyfv Mar 27, 2024 Author

elgonzo Mar 27, 2024

huoyaoyuan Mar 27, 2024 Collaborator

alexeyfv Mar 27, 2024 Author

alexeyfv
Mar 26, 2024

Replies: 5 comments 2 replies

elgonzo
Mar 26, 2024

elgonzo
Mar 26, 2024

alexeyfv Mar 27, 2024
Author

alexeyfv Mar 27, 2024
Author

elgonzo
Mar 27, 2024

huoyaoyuan
Mar 27, 2024
Collaborator

alexeyfv
Mar 27, 2024
Author