Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhead match workload #2309

Closed
wants to merge 10 commits into from

Conversation

timcassell
Copy link
Collaborator

@timcassell timcassell commented May 17, 2023

Matched overhead to workload return type and implementation.
Optimized Consumer.Consume<T>(in T value).
Use Consumer more liberally in benchmarks.
Cleaned up byref returns.
Fixed pointer returns in InProcessEmitToolchain.

Fixes #2305

Match overhead action implementation to workload action implementation.
Support more consume types.
Cleaned up byref returns.
@AndreyAkinshin
Copy link
Member

@timcassell thanks for the PR! It will take some time for me to properly review and test it, but I will try to do it soon.

As for now, I have one high-level concern: if you need C# 11 for these changes, please do it in a separate PR. It's a significant upgrade for the source code base, and I would prefer to apply it independently from the overhead-related refactoring.

@timcassell
Copy link
Collaborator Author

I just did that for raw string literal convenience. I didn't need it otherwise. Reverted.

@timcassell
Copy link
Collaborator Author

timcassell commented May 17, 2023

I don't know if anyone actually ran into it, but I noticed the InProcessEmit tests were missing pointer returns, so I added that test and fixed it here, too (extra confidence that my changes work for both csproj gen and inprocess toolchains).

…n (ByRefLike and pointers), use Unsafe.SkipInit for everything else.

Check for pointer type when setting default local in InProcessEmitToolchain.
@timcassell

This comment was marked as resolved.

@AndreyAkinshin
Copy link
Member

@timcassell just wanted to say that the review is still in process. I tend to procrastinate merging pull requests touching toolchains. Since we do not have 100% test coverage, changes to these critical parts demand additional verification to prevent potential not-so-obvious bugs in corner cases.
Thank you for your patience!

@AndreyAkinshin
Copy link
Member

AndreyAkinshin commented Jun 7, 2023

@timcassell, as I can see if the benchmark methods return int, we generate the following overhead method:

        private  System.Int32 __Overhead() // __ is to avoid possible name conflict
        {
            System.Runtime.CompilerServices.Unsafe.SkipInit(out System.Int32 value);
return value;
        }

I don't think that it's a good idea to use SkipInit instead of return default. In my local experiments for Windows+Mono, such a SkipInit-based overhead method may actually consume more time than [Benchmark] public int Foo() => 0.

I would suggest restoring the original behavior (default for primitive types; null for objects).

@timcassell
Copy link
Collaborator Author

I don't think that it's a good idea to use SkipInit instead of return default. In my local experiments for Windows+Mono, such a SkipInit-based overhead method may actually consume more time than [Benchmark] public int Foo() => 0.

Interesting, I thought that would be inlined and eliminated as a noop. I guess the Mono JIT isn't that powerful.

I would suggest restoring the original behavior (default for primitive types; null for objects).

Should I do that for any type whose size is <= IntPtr.Size? Or <= 8 (would include double and long on 32-bit runtime)? Or only primitives and references?

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 7, 2023

@AndreyAkinshin I ran some benches on my Windows machine and got these results. Are you sure what you're seeing isn't just some noise in the system?

|                         Type |   Method |  Runtime |       Mean |
|----------------------------- |--------- |--------- |-----------:|
|     DefaultVsSkipInit<Int32> |  Default | .NET 7.0 |  1.0309 ns |
|     DefaultVsSkipInit<Int32> | SkipInit | .NET 7.0 |  1.1523 ns |
|                              |          |          |            |
|     DefaultVsSkipInit<Int64> |  Default | .NET 7.0 |  1.3675 ns |
|     DefaultVsSkipInit<Int64> | SkipInit | .NET 7.0 |  1.2110 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct16> |  Default | .NET 7.0 |  0.8585 ns |
|  DefaultVsSkipInit<Struct16> | SkipInit | .NET 7.0 |  0.2104 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct32> |  Default | .NET 7.0 |  1.7938 ns |
|  DefaultVsSkipInit<Struct32> | SkipInit | .NET 7.0 |  0.1544 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct64> |  Default | .NET 7.0 |  3.2387 ns |
|  DefaultVsSkipInit<Struct64> | SkipInit | .NET 7.0 |  0.1461 ns |
|                              |          |          |            |
| DefaultVsSkipInit<Struct128> |  Default | .NET 7.0 |  7.1040 ns |
| DefaultVsSkipInit<Struct128> | SkipInit | .NET 7.0 |  0.1735 ns |
|                              |          |          |            |
|     DefaultVsSkipInit<Int32> |  Default |     Mono |  3.2405 ns |
|     DefaultVsSkipInit<Int32> | SkipInit |     Mono |  2.5633 ns |
|                              |          |          |            |
|     DefaultVsSkipInit<Int64> |  Default |     Mono |  2.9140 ns |
|     DefaultVsSkipInit<Int64> | SkipInit |     Mono |  2.3334 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct16> |  Default |     Mono |  2.3247 ns |
|  DefaultVsSkipInit<Struct16> | SkipInit |     Mono |  3.1899 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct32> |  Default |     Mono |  3.3098 ns |
|  DefaultVsSkipInit<Struct32> | SkipInit |     Mono |  2.9090 ns |
|                              |          |          |            |
|  DefaultVsSkipInit<Struct64> |  Default |     Mono |  5.1525 ns |
|  DefaultVsSkipInit<Struct64> | SkipInit |     Mono |  3.3875 ns |
|                              |          |          |            |
| DefaultVsSkipInit<Struct128> |  Default |     Mono | 40.0935 ns |
| DefaultVsSkipInit<Struct128> | SkipInit |     Mono | 20.8666 ns |
Code

public struct Struct16
{
    public (long, long) field;
}

public struct Struct32
{
    public (Struct16, Struct16) field;
}

public struct Struct64
{
    public (Struct32, Struct32) field;
}

public struct Struct128
{
    public (Struct64, Struct64) field;
}

[GenericTypeArguments(typeof(int))]
[GenericTypeArguments(typeof(long))]
[GenericTypeArguments(typeof(Struct16))]
[GenericTypeArguments(typeof(Struct32))]
[GenericTypeArguments(typeof(Struct64))]
[GenericTypeArguments(typeof(Struct128))]
public class DefaultVsSkipInit<T>
{
    public volatile byte _byte;

    [Benchmark]
    public void Default()
    {
        Consume(default(T));
    }

    [Benchmark]
    public void SkipInit()
    {
        Unsafe.SkipInit(out T value);
        Consume(value);
    }

    private void Consume(in T value)
    {
        _byte = Unsafe.As<T, byte>(ref Unsafe.AsRef(value));
    }
}

[Edit] I ran benchmarks again with directly typed out types (instead of generic), and it still just looks like noise to me:

Details

|              Method |  Runtime |      Mean |
|-------------------- |--------- |----------:|
|          DefaultInt | .NET 7.0 | 1.3880 ns |
|         SkipInitInt | .NET 7.0 | 1.1040 ns |
|                     |          |           |
|         DefaultLong | .NET 7.0 | 0.8975 ns |
|        SkipInitLong | .NET 7.0 | 1.1761 ns |
|                     |          |           |
|       DefaultObject | .NET 7.0 | 0.8632 ns |
|      SkipInitObject | .NET 7.0 | 0.8647 ns |
|                     |          |           |
|     DefaultStruct16 | .NET 7.0 | 0.8540 ns |
|    SkipInitStruct16 | .NET 7.0 | 0.2093 ns |
|                     |          |           |
|     DefaultStruct32 | .NET 7.0 | 1.9321 ns |
|    SkipInitStruct32 | .NET 7.0 | 0.1235 ns |
|                     |          |           |
|  DefaultEmptyStruct | .NET 7.0 | 1.3040 ns |
| SkipInitEmptyStruct | .NET 7.0 | 0.4895 ns |
|                     |          |           |
|          DefaultInt |     Mono | 0.5304 ns |
|         SkipInitInt |     Mono | 0.4205 ns |
|                     |          |           |
|         DefaultLong |     Mono | 0.5642 ns |
|        SkipInitLong |     Mono | 0.4595 ns |
|                     |          |           |
|       DefaultObject |     Mono | 0.7144 ns |
|      SkipInitObject |     Mono | 0.3955 ns |
|                     |          |           |
|     DefaultStruct16 |     Mono | 1.3614 ns |
|    SkipInitStruct16 |     Mono | 0.4867 ns |
|                     |          |           |
|     DefaultStruct32 |     Mono | 1.2778 ns |
|    SkipInitStruct32 |     Mono | 0.9759 ns |
|                     |          |           |
|  DefaultEmptyStruct |     Mono | 0.6031 ns |
| SkipInitEmptyStruct |     Mono | 0.4708 ns |
public class DefaultVsSkipInit
{
    public volatile byte _byte;

    [Benchmark]
    public void DefaultInt()
    {
        Consume(default(int));
    }

    [Benchmark]
    public void SkipInitInt()
    {
        Unsafe.SkipInit(out int value);
        Consume(value);
    }

    [Benchmark]
    public void DefaultLong()
    {
        Consume(default(long));
    }

    [Benchmark]
    public void SkipInitLong()
    {
        Unsafe.SkipInit(out long value);
        Consume(value);
    }

    [Benchmark]
    public void DefaultObject()
    {
        Consume(default(object));
    }

    [Benchmark]
    public void SkipInitObject()
    {
        Unsafe.SkipInit(out object value);
        Consume(value);
    }

    [Benchmark]
    public void DefaultStruct16()
    {
        Consume(default(Struct16));
    }

    [Benchmark]
    public void SkipInitStruct16()
    {
        Unsafe.SkipInit(out Struct16 value);
        Consume(value);
    }

    [Benchmark]
    public void DefaultStruct32()
    {
        Consume(default(Struct32));
    }

    [Benchmark]
    public void SkipInitStruct32()
    {
        Unsafe.SkipInit(out Struct32 value);
        Consume(value);
    }

    [Benchmark]
    public void DefaultEmptyStruct()
    {
        Consume(default(EmptyStruct));
    }

    [Benchmark]
    public void SkipInitEmptyStruct()
    {
        Unsafe.SkipInit(out EmptyStruct value);
        Consume(value);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private void Consume<T>(in T value)
    {
        _byte = Unsafe.As<T, byte>(ref Unsafe.AsRef(value));
    }
}

And even with the noise, it looks like Unsafe.SkipInit is still slightly faster, even for int.

@AndreyAkinshin
Copy link
Member

Hmm. This is strange. I have performed the following integration experiment. I wrote the following benchmark:

[Config(typeof(Config))]
public class Benchmarks
{
    private class Config : ManualConfig
    {
        public Config()
        {
            var job = Job.Default
                    .WithRuntime(MonoRuntime.Default)
                    .WithIterationCount(100)
                    .WithInvocationCount(160_000_000)
                ;
            AddJob(job);
        }
    }
    [Benchmark] public int Foo() => 0;
}

Also, I hacked the BenchmarkDotNet sources to force it always perform 100 overhead iterations:

diff --git a/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs b/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs
--- a/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs	(revision 56756271f63663c406b82077209a6e630d703b1d)
+++ b/src/BenchmarkDotNet/Engines/EngineGeneralStage.cs	(date 1686227871484)
@@ -32,7 +32,7 @@
         }
 
         public IReadOnlyList<Measurement> RunOverhead(long invokeCount, int unrollFactor)
-            => RunAuto(invokeCount, IterationMode.Overhead, unrollFactor);
+            => RunSpecific(invokeCount, IterationMode.Overhead, 100, unrollFactor);
 
         public IReadOnlyList<Measurement> RunWorkload(long invokeCount, int unrollFactor, bool forceSpecific = false)
             => Run(invokeCount, IterationMode.Workload, false, unrollFactor, forceSpecific);

Thus, we get 100 overhead measurements and 100 workload measurements.

Here are the corresponding density plots for the master branch:
bdn-stable

And here are the plots for your PR:
bdn-pr2309

As you can see, the left mode of the overhead distribution is approximately 1 CPU cycle larger than the left mode of the workload distribution.

Raw data:
bdn-stable.csv
bdn-pr2309.csv

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin How are you generating those plots? I want to try that on my machine.

@AndreyAkinshin
Copy link
Member

First, I generated csv from summary:

var summary = BenchmarkRunner.Run<Benchmarks>();
var builder = new StringBuilder();
Thread.CurrentThread.CurrentCulture = DefaultCultureInfo.Instance;
builder.AppendLine("duration,type");
foreach (var measurement in summary.Reports.First().AllMeasurements.Where(m => m.IsOverhead() && m.IterationStage == IterationStage.Actual))
    builder.AppendLine((measurement.Nanoseconds / measurement.Operations).ToString("N4") + ",overhead");
foreach (var measurement in summary.Reports.First().AllMeasurements.Where(m => m.IsWorkload() && m.IterationStage == IterationStage.Actual))
    builder.AppendLine((measurement.Nanoseconds / measurement.Operations).ToString("N4") + ",workload");
File.WriteAllText("bdn.csv", builder.ToString());

(Yes, we can introduce a new exporter or reuse the existing ones, but It was the fastest way for me to obtain the first results).

Next, I used the following R script to generate plots:

# Run `install.packages("ggplot2")` to install the package for the first time
library(ggplot2)

df <- read.csv("bdn.csv")
ggplot(df, aes(duration, col = type)) +
  geom_density(bw = "SJ") +
  geom_rug() +
  labs(
    x = "Duration, ns",
    y = "Density",
    col = "Method"
  )

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 8, 2023

Very interesting. When I run with those forced iteration and invocation counts, I'm seeing similar results as you. Strange that my heuristic benchmarks were having different results. Anyway, I'll do some more testing with these forced invoke counts to see what types should have default vs skipinit, and update this PR accordingly.

[Edit] Oh, it looks like it's because I was consuming it directly in my benchmark instead of returning the value. It had nothing to do with the invoke counts. I'm not really sure why that causes the discrepancy, though.

@timcassell
Copy link
Collaborator Author

Well, it looks like default is almost universally faster than Unsafe.SkipInit when just using it to return the value, even up to a struct size of 1024 bytes.

Results

// Mono

|              Method |       Mean |     Error |    StdDev |     Median |
|-------------------- |-----------:|----------:|----------:|-----------:|
|    ReturnDefaultInt |   3.392 ns | 0.0287 ns | 0.0268 ns |   3.397 ns |
|   ReturnSkipInitInt |   3.993 ns | 0.0254 ns | 0.0225 ns |   3.988 ns |

|         DefaultLong |   3.504 ns | 0.0093 ns | 0.0082 ns |   3.507 ns |
|        SkipInitLong |   4.262 ns | 0.0849 ns | 0.1934 ns |   4.224 ns |

|       DefaultObject |   5.468 ns | 0.0371 ns | 0.0329 ns |   5.472 ns |
|      SkipInitObject |   6.319 ns | 0.0840 ns | 0.0786 ns |   6.325 ns |

|     DefaultStruct16 |  14.526 ns | 0.1066 ns | 0.0890 ns |  14.551 ns |
|    SkipInitStruct16 |  15.296 ns | 0.3015 ns | 0.2821 ns |  15.180 ns |

|     DefaultStruct32 |  15.909 ns | 0.0884 ns | 0.0827 ns |  15.973 ns |
|    SkipInitStruct32 |  17.183 ns | 0.0700 ns | 0.0584 ns |  17.192 ns |

|     DefaultStruct64 |  23.040 ns | 0.1999 ns | 0.1870 ns |  22.989 ns |
|    SkipInitStruct64 |  25.419 ns | 0.0884 ns | 0.0690 ns |  25.442 ns |

|    DefaultStruct128 | 117.386 ns | 1.0833 ns | 0.9046 ns | 117.527 ns |
|   SkipInitStruct128 | 140.688 ns | 1.7634 ns | 1.6494 ns | 140.357 ns |

|  DefaultEmptyStruct |  10.989 ns | 0.0484 ns | 0.0429 ns |  10.995 ns |
| SkipInitEmptyStruct |  10.888 ns | 0.0538 ns | 0.0449 ns |  10.890 ns |

// .Net 7

|              Method |      Mean |     Error |    StdDev |    Median |
|-------------------- |----------:|----------:|----------:|----------:|
|    ReturnDefaultInt |  2.262 ns | 0.0103 ns | 0.0086 ns |  2.263 ns |
|   ReturnSkipInitInt |  2.209 ns | 0.0139 ns | 0.0116 ns |  2.211 ns |

|         DefaultLong |  2.168 ns | 0.0414 ns | 0.0580 ns |  2.141 ns |
|        SkipInitLong |  2.186 ns | 0.0271 ns | 0.0226 ns |  2.181 ns |

|       DefaultObject |  4.535 ns | 0.0781 ns | 0.0693 ns |  4.538 ns |
|      SkipInitObject |  4.767 ns | 0.0915 ns | 0.0940 ns |  4.760 ns |

|     DefaultStruct16 |  3.080 ns | 0.0606 ns | 0.1441 ns |  3.012 ns |
|    SkipInitStruct16 |  3.312 ns | 0.0639 ns | 0.0937 ns |  3.315 ns |

|     DefaultStruct32 |  3.323 ns | 0.0202 ns | 0.0169 ns |  3.322 ns |
|    SkipInitStruct32 | 10.750 ns | 0.1241 ns | 0.1100 ns | 10.722 ns |

|     DefaultStruct64 |  4.113 ns | 0.0551 ns | 0.0489 ns |  4.107 ns |
|    SkipInitStruct64 | 13.704 ns | 0.1115 ns | 0.0989 ns | 13.688 ns |

|    DefaultStruct128 |  5.124 ns | 0.0334 ns | 0.0296 ns |  5.127 ns |
|   SkipInitStruct128 | 16.788 ns | 0.3314 ns | 0.7062 ns | 16.597 ns |

|  DefaultEmptyStruct |  2.935 ns | 0.0227 ns | 0.0201 ns |  2.930 ns |
| SkipInitEmptyStruct |  2.740 ns | 0.0278 ns | 0.0260 ns |  2.728 ns |

I even tried adding SkipLocalsInitAttribute, and it did get slightly faster, but still not quite as fast as default.

Method Mean Error StdDev
DefaultStruct1024 21.17 ns 0.361 ns 0.338 ns
SkipInitStruct1024 48.10 ns 0.922 ns 0.906 ns
SkipLocalsInitStruct1024 33.75 ns 0.629 ns 0.618 ns
Struct1024Field 32.06 ns 0.401 ns 0.375 ns

So, I went ahead and removed Unsafe.SkipInit in favor of default for all types.

@timcassell

This comment was marked as outdated.

@AndreyAkinshin
Copy link
Member

Ok, now we should discuss the original problem that I described in #2305. Let's consider the following benchmark:

public struct HugeStruct
{
    public long L00;
    public long L01;
    public long L02;
    public long L03;
    public long L04;
    public long L05;
    public long L06;
    public long L07;
    public long L08;
    public long L09;
}


public class Benchmarks
{
    private class Config : ManualConfig
    {
        public Config()
        {
            var job = Job.Default
                    .WithRuntime(MonoRuntime.Default)
                    .WithIterationCount(100)
                    .WithInvocationCount(16_000_000)
                ;
            AddJob(job);
        }
    }
    private HugeStruct s = new HugeStruct();
    [Benchmark] public HugeStruct Foo() => s;
}

Using the same approach as before, I got the following picture (I took the latest state of this PR):
bdn
As you can see, the overhead takes noticeably more time than workload.

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 9, 2023

Hmm... the result of that measurement seems to depend on the runtime. In .Net 7 and Framework, I observe that default is always fastest. But in Mono, reading from a field is fastest.

// Mono
|             Method |     Mean |   Error |  StdDev |
|------------------- |---------:|--------:|--------:|
|  DefaultStruct1024 | 261.5 ns | 1.06 ns | 0.88 ns |
| SkipInitStruct1024 | 294.7 ns | 1.61 ns | 1.50 ns |
|    FieldStruct1024 | 229.6 ns | 0.45 ns | 0.40 ns |
| UnsafeAsStruct1024 | 238.2 ns | 0.50 ns | 0.44 ns |

// .Net 7
|             Method |     Mean |    Error |   StdDev |
|------------------- |---------:|---------:|---------:|
|  DefaultStruct1024 | 20.42 ns | 0.097 ns | 0.081 ns |
| SkipInitStruct1024 | 45.71 ns | 0.143 ns | 0.119 ns |
|    FieldStruct1024 | 32.54 ns | 0.073 ns | 0.065 ns |
| UnsafeAsStruct1024 | 33.00 ns | 0.099 ns | 0.083 ns |

// .Net Framework 4.7.2
|             Method |     Mean |    Error |   StdDev |
|------------------- |---------:|---------:|---------:|
|  DefaultStruct1024 | 16.64 ns | 0.196 ns | 0.174 ns |
| SkipInitStruct1024 | 73.49 ns | 0.217 ns | 0.203 ns |
|    FieldStruct1024 | 30.84 ns | 0.168 ns | 0.157 ns |
| UnsafeAsStruct1024 | 42.91 ns | 0.835 ns | 0.893 ns |

I also experimented with another unsafe approach, which got close to the field read in Mono, but is still slower in other runtimes.

[Benchmark]
public Struct1024 UnsafeAsStruct1024()
{
    byte b = 0;
    return Unsafe.As<byte, Struct1024>(ref b);
}

Can we branch the code generated depending on the runtime? We could use field read in Mono, and default in other runtimes.

[Edit] Mono with .Net 7 results (wow, it's slooooow!), default is still fastest, so it seems only classic Mono has the field vs default issue.

|             Method |     Mean |     Error |    StdDev |
|------------------- |---------:|----------:|----------:|
|  DefaultStruct1024 | 1.021 us | 0.0096 us | 0.0090 us |
| SkipInitStruct1024 | 1.295 us | 0.0055 us | 0.0052 us |
|    FieldStruct1024 | 1.209 us | 0.0029 us | 0.0027 us |
| UnsafeAsStruct1024 | 1.002 us | 0.0043 us | 0.0041 us |

@timcassell timcassell force-pushed the overhead-match-workload branch 2 times, most recently from 44f74c4 to cc207fa Compare June 10, 2023 08:06
@timcassell
Copy link
Collaborator Author

timcassell commented Jun 12, 2023

@AndreyAkinshin Latest uses field read if the struct is > 64 bytes and the runtime is old Mono, default for all other cases. Is that ok? (Tbh I'm not sure how 32-bit works, if we can even mix 32/64-bit host/benchmark process. I only tested 64-bit.)

@AndreyAkinshin
Copy link
Member

Latest uses field read if the struct is > 64 bytes and the runtime is old Mono, default for all other cases.

I don't like this approach. It feels too hacky and fragile. I suspect that it would be quite challenging to maintain it and keep it relevant to all the runtimes we support. Another severe issue is the fairness of comparison between different benchmarks. Could we compare absolute nanobenchmark measurements between Mono and non-Mono? Or a benchmark with struct size of 64 and 65 bytes?

The original idea behind the overhead evaluation was to evaluate the overhead of the general BenchmarkDotNet infrastructure, like method/delegates calls and the main loop that performs invocation. The delegate call evaluation includes the instruction pointer jumps. In order to make the overhead evaluation most "honest," we try to match the signature of the overhead method to the signature of the target workload method (where such an approach is applicable). We use this signature mimicry to ensure that the measured duration of an empty method (e.g., int Foo() => 0). Therefore, BenchmarkDotNet always tries to provide the actual duration of the method body, excluding calling overhead.

In the case of an empty int method (like int Foo() => 0), the "zero" result seems reasonable. However, in the case of a returned struct, we don't have such an obvious baseline. As we can see, both creating a new struct instance and reading a struct instance from memory takes some time. Non-zero results in this case may be considered as expected.

Thus, I don't think we should use the actual struct type of the target workload method for the overhead method signature. This gives us two options for the overhead return type:

  1. Use a primitive type. In this case, the workload method calling infrastructure shouldn't pass .Field to the Consumer to avoid additional overhead.
  2. Use a fake struct type with one or two fields and pass .Field to Consumer for both overhead and workload methods.

@timcassell what do you think?

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 15, 2023

So, if we were to measure the cost of creating 2 different structs

[Benchmark] public SmallStruct Small() => default;
[Benchmark] public LargeStruct Large() => default;

In the current version, they would both report 0. But we may want them to report different time values (since it does actually take some time).

Use a primitive type. In this case, the workload method calling infrastructure shouldn't pass .Field to the Consumer to avoid additional overhead.
Use a fake struct type with one or two fields and pass .Field to Consumer for both overhead and workload methods.

I don't like either of these since they don't solve the issue. Using a primitive type instead of the actual type is what we're already doing. And we can't just remove .field from ref structs.

The original idea behind the overhead evaluation was to evaluate the overhead of the general BenchmarkDotNet infrastructure, like method/delegates calls and the main loop that performs invocation. The delegate call evaluation includes the instruction pointer jumps.

Would it be reasonable for the overhead delegate to always return void, and we stub in the value for the Consumer call in the OverheadAction?

private void OverheadActionNoUnroll(System.Int64 invokeCount)
{
    LargeStruct consumeMe = default;
    for (System.Int64 i = 0; i < invokeCount; i++)
    {
        overheadDelegate();
        consumer.Consume(consumeMe);
    }
}

[Edit] As for the time spent in the OverheadActionNoUnroll initializing the default(LargeStruct), in #2111 I refactored it so that the BenchmarkType starts the clock instead of the Engine, so it would be trivial to initialize that before starting the clock if that were to be merged.

@AndreyAkinshin
Copy link
Member

In the current version, they would both report 0. But we may want them to report different time values (since it does actually take some time).

I just checked how it works with the latest version of BenchmarkDotNet from the master branch:

public struct SmallStruct
{
    public long Field00 { get; }
}

public struct LargeStruct
{
    public long Field00 { get; }
    public long Field01 { get; }
    public long Field02 { get; }
    public long Field03 { get; }
    public long Field04 { get; }
    public long Field05 { get; }
    public long Field06 { get; }
    public long Field07 { get; }
    public long Field08 { get; }
    public long Field09 { get; }
    public long Field10 { get; }
    public long Field11 { get; }
    public long Field12 { get; }
    public long Field13 { get; }
    public long Field14 { get; }
    public long Field15 { get; }
    public long Field16 { get; }
    public long Field17 { get; }
    public long Field18 { get; }
    public long Field19 { get; }
    public long Field20 { get; }
    public long Field21 { get; }
    public long Field22 { get; }
    public long Field23 { get; }
    public long Field24 { get; }
    public long Field25 { get; }
    public long Field26 { get; }
    public long Field27 { get; }
    public long Field28 { get; }
    public long Field29 { get; }
    public long Field30 { get; }
    public long Field31 { get; }
    public long Field32 { get; }
    public long Field33 { get; }
    public long Field34 { get; }
    public long Field35 { get; }
    public long Field36 { get; }
    public long Field37 { get; }
    public long Field38 { get; }
    public long Field39 { get; }
    public long Field40 { get; }
    public long Field41 { get; }
    public long Field42 { get; }
    public long Field43 { get; }
    public long Field44 { get; }
    public long Field45 { get; }
    public long Field46 { get; }
    public long Field47 { get; }
    public long Field48 { get; }
    public long Field49 { get; }
}

public class Benchmarks
{
    [Benchmark] public SmallStruct Small() => default;
    [Benchmark] public LargeStruct Large() => default;
}

Here are the results:

BenchmarkDotNet=v0.13.5.20230616-develop, OS=ubuntu 22.04
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.304
  [Host]     : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2


| Method |      Mean |     Error |    StdDev |
|------- |----------:|----------:|----------:|
|  Small | 0.0000 ns | 0.0000 ns | 0.0000 ns |
|  Large | 2.8153 ns | 0.0254 ns | 0.0237 ns |

// * Warnings *
ZeroMeasurement
  Benchmarks.Small: Default -> The method duration is indistinguishable from the empty method duration

This looks like a reasonable result. I would like to keep this behavior.

Would it be reasonable for the overhead delegate to always return void, and we stub in the value for the Consumer call in the OverheadAction?

I have some doubts about this idea. Let's consider the following benchmarks:

[Benchmark] public object EmptyObject() => null;
[Benchmark] public int EmptyInt() => 0;
[Benchmark] public double EmptyDouble() => 0.0;

With the current version of BenchmarkDotNet, all of these benchmarks have "zero" duration. I would like to keep this behavior. If void Overhead allows us to achieve this on all the runtimes, we can consider such an approach (however, I would expect some non-zero results).

in #2111 I refactored it so that the BenchmarkType starts the clock instead of the Engine, so it would be trivial to initialize that before starting the clock if that were to be merged.

Sorry for keeping #2111 without a review for such a long time. As you can see, it's quite challenging to verify engine/toolchain-related changes: any modifications in these subsystems may lead to surprisingly unpleased side effects. I will try to make another attempt to review it and share my thoughts.

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 16, 2023

I just checked how it works with the latest version of BenchmarkDotNet from the master branch:

Yeah, I mean the current version of this PR, sorry.

I have some doubts about this idea. Let's consider the following benchmarks:

[Benchmark] public object EmptyObject() => null;
[Benchmark] public int EmptyInt() => 0;
[Benchmark] public double EmptyDouble() => 0.0;

With the current version of BenchmarkDotNet, all of these benchmarks have "zero" duration. I would like to keep this behavior. If void Overhead allows us to achieve this on all the runtimes, we can consider such an approach (however, I would expect some non-zero results).

Would that be so bad, though? After all, I would expect this

IL_0000: ldc.i4.0
IL_0001: ret

To take some non-zero time longer than this

IL_0000: ret

I mean yes, people may see "regressions", but it would be more accurate, wouldn't it?

Sorry for keeping #2111 without a review for such a long time. As you can see, it's quite challenging to verify engine/toolchain-related changes: any modifications in these subsystems may lead to surprisingly unpleased side effects. I will try to make another attempt to review it and share my thoughts.

Excellent! I'll go fix the merge issues, then! You may want to review #2108 first, though.

@timcassell
Copy link
Collaborator Author

timcassell commented Jun 18, 2023

I experimented with that idea, and I didn't like the results I was getting. The difference between loading a local to pass to Consume and just passing the value that's already on the stack from the workload call is too great.

So I studied the WorkloadAction implementation to try to find another method, and I thought, why is the Consume call even necessary at all? So I stripped it out and replaced it with Thread.MemoryBarrier(). With that and the overhead method returning void, these are the results I got:

Method Runtime Mean Error StdDev
Void .NET 7.0 0.0038 ns 0.0089 ns 0.0075 ns
Int .NET 7.0 0.0000 ns 0.0000 ns 0.0000 ns
Long .NET 7.0 0.0000 ns 0.0000 ns 0.0000 ns
Object .NET 7.0 0.0000 ns 0.0000 ns 0.0000 ns
Pointer .NET 7.0 0.0023 ns 0.0103 ns 0.0080 ns
Struct16 .NET 7.0 0.1410 ns 0.0431 ns 0.0819 ns
EmptyStruct .NET 7.0 0.0003 ns 0.0009 ns 0.0008 ns
Void Mono 0.0426 ns 0.0025 ns 0.0019 ns
Int Mono 0.0397 ns 0.0037 ns 0.0033 ns
Long Mono 0.0000 ns 0.0000 ns 0.0000 ns
Object Mono 0.0387 ns 0.0039 ns 0.0035 ns
Pointer Mono 0.0000 ns 0.0000 ns 0.0000 ns
Struct16 Mono 13.8994 ns 0.3294 ns 0.4829 ns
EmptyStruct Mono 5.8519 ns 0.0284 ns 0.0222 ns

The default primitives and null are at or close to 0 (with a bit of noise in the system), which matches the current results.

Can we refactor to remove Consumer from the BenchmarkType altogether? I think if we call the workload method with NoInlining (which to my understanding was the purpose of wrapping it in a delegate) along with Thread.MemoryBarrier, it should prevent any dead code elimination, or out-of-order execution (maybe the NoInlining is enough by itself?).

for (System.Int64 i = 0; i < invokeCount; i++)
{
    workloadDelegate($PassArguments$); // We don't do anything with the return value, it just gets popped off the stack.
    System.Threading.Thread.MemoryBarrier();
}

@timcassell
Copy link
Collaborator Author

@AndreyAkinshin This PR turned out to be the wrong direction, so I'm closing it in favor of #2336. I also put up a separate PR #2337 for just the Consumer optimization.

@timcassell timcassell closed this Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overhead and workload invocation sequences diverge
2 participants