Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhead and workload invocation sequences diverge #2305

Open
AndyAyersMS opened this issue May 13, 2023 · 11 comments · May be fixed by #2336
Open

overhead and workload invocation sequences diverge #2305

AndyAyersMS opened this issue May 13, 2023 · 11 comments · May be fixed by #2336
Assignees

Comments

@AndyAyersMS
Copy link
Member

If these don't match then benchmark times will be either under or over estimated.

Context dotnet/runtime#86033 (comment)

This is the "good" case where the jit isn't messing up the workload invoke codegen (which just makes the problem worse).

        public delegate System.Single OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {
            
            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate());


        public delegate  System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {
            
            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(workloadDelegate().X);
;; overhead

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+28H]
       mov      rcx, gword ptr [rax+08H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+OverheadDelegate:Invoke():float:this
       vmovss   dword ptr [rbp+48H], xmm0

;; workload

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+30H]
       mov      rcx, gword ptr [rax+08H]
       lea      rdx, [rsp+110H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+WorkloadDelegate:Invoke():System.Numerics.Vector3:this
       vmovss   xmm0, dword ptr [rsp+110H]
       vmovss   dword ptr [rbp+48H], xmm0

May be restricted to certain return types like Vector; I haven't seen this elsewhere.

@AndreyAkinshin
Copy link
Member

@AndyAyersMS Thanks for the bug report! I can confirm that it is a severe issue that affects all the benchmarks that return structs (except IntPtr, UIntPtr). The minimal repro:

public struct MyStruct
{
    public int Value;
}

public class Benchmarks
{
    [Benchmark]
    public MyStruct Foo() => new MyStruct();
}

However, it's not clear to me how to properly fix this problem. Below are some of my thoughts on this.

Firstly, we have two primary ways to consume the return value of the workload method:

consumer.Consume(workloadDelegate().Value); // Workload:1
consumer.Consume(workloadDelegate()); // Workload:2

The first one (Workload:1) is the current default. We don't use the second one (Workload:2) right now because it has object wrapping as a side-effect (Consume can natively consume only primitive types, IntPtr, UIntPtr, object).

Secondly, we have several ways to define the overhead delegate:

// Overhead:1
private System.Int32 __Overhead()
{
    return default(System.Int32);
}

// Overhead:2
private MyStruct __Overhead()
{
    return new MyStruct();
}

// Overhead:3
private MyStruct value;

private System.Int32 __Overhead()
{
    return value;
}

Here are some comments about all of these options:

  • Overhead:1 It is the current default: it always prefers a type that can be natively consumed by Consumer. Unfortunately, it doesn't match either Workload:1 (which has an additional .Value accessor) or Workload:2.
  • Overhead:2 It matches Workload:2, but it has a call of the MyStruct constructor as a side effect. If Workload doesn't create a new struct (e.g., it reads a value from a field), such an Overhead implementation is incorrect.
  • Overhead:3 It matches Workload:2, but it has a field reading as a side effect. If Workload doesn't read a field (e.g., it creates a new MyStruct instance), such an Overhead implementation is incorrect.

Since we don't know in advance how the given benchmark obtains the struct value, it's quite hard to provide a proper Overhead implementation that matches the Workload implementation. Therefore, BenchmarkDotNet uses Overhead:1 to reduce the risk of mismatched implementation. Unfortunately, it leads to other issues that are presented in dotnet/runtime#86033

At the moment, I have only one idea of how to provide a proper baseline for benchmarks that return structs:

  1. We introduce our own single-field struct in the BenchmarkDotNet template:
public struct OverheadStruct
{
    public int Value;
}
  1. The overhead method returns a new instance of this struct:
private OverheadStruct __Overhead()
{
    return new OverheadStruct();
}

Since OverheadStruct contains a single field, it should provide a nice baseline for struct creation (at least for structs that contain at least one field).

  1. For both overheadDelegate and overheadDelegate we pass a field value to a consumer (whenever field consuming is applicable):
        public delegate OverheadStruct OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {

            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().Value);


        public delegate System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {
            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().X);

I don't feel like it is a perfect solution, but it should (seemingly) provide a more accurate baseline for benchmarks that return structs.

@AndyAyersMS @adamsitnik What do you think?

@timcassell
Copy link
Collaborator

timcassell commented May 14, 2023

@AndreyAkinshin What about adding a Consumer<T> type to consume the struct as a whole? (Obviously would not work with ref struct, but neither does the current consumer.)

@timcassell
Copy link
Collaborator

it has object wrapping as a side-effect

Also, if you use Consume<T>(in T value), it does not box the value (nullables fixed in #2191).

@AndreyAkinshin
Copy link
Member

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

@timcassell
Copy link
Collaborator

timcassell commented May 14, 2023

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

I would assume that any consume code would be replicated in overhead regardless, no?

Anyway, I was thinking of simplifying it to this.

public unsafe void Consume<T>(in T value)
    => ptrHolder = (IntPtr) Unsafe.AsPointer(ref Unsafe.AsRef(in value));

Then it won't matter how large the struct is, it's only grabbing its reference. What do you think? (current implementation passes it to DeadCodeEliminationHelper.KeepAliveWithoutBoxingReadonly)

@timcassell
Copy link
Collaborator

And I think the overhead method can be changed to this:

private MyStruct __Overhead()
{
    Unsafe.SkipInit(out MyStruct value);
    return value;
}

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

@AndreyAkinshin
Copy link
Member

Then it won't matter how large the struct is, it's only grabbing its reference.

Sounds good.

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

It's a great idea, I like it!

Do you want to send a PR?

@timcassell
Copy link
Collaborator

Do you want to send a PR?

Sure thing.

@MichalPetryka
Copy link
Contributor

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

@AndyAyersMS
Copy link
Member Author

Another thought is to not optimize these methods, that way the return value can be unconsumed but presumably would always still be produced. But it would make overhead higher, which is probably less desirable.

@timcassell
Copy link
Collaborator

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

I thought the same thing. There was a short discussion about that in #2173.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants