Call stack performance investigation #8178

rahulchaphalkar · 2024-03-18T23:49:14Z

I am running ackermann benchmark with wasmtime, and I noticed that it had a performance delta when compared with native, of approx 30%. Profiling with VTune, I see wasmtime disassembly containing lot of setup/teardown function call stack related instructions at the beginning and end of the function, while native (clang, -O3) does not.
I used wasmtime explore to correlate the wat with disassembly as well. Here are the snippets of disassembly -

Wasm Setup of the stack -

Address	Source Line	Assembly
0x7f2b691d8040	0	push rbp
0x7f2b691d8041	0	mov rbp, rsp							
0x7f2b691d8044	0	mov r10, qword ptr [rdi+0x8]
0x7f2b691d8048	0	mov r10, qword ptr [r10]
0x7f2b691d804b	0	cmp r10, rsp
0x7f2b691d804e	0	jnbe 0x7f2b691d80b7 <Block 9>							
0x7f2b691d8054	0	Block 2:							
0x7f2b691d8054	0	sub rsp, 0x10							
0x7f2b691d8058	0	mov qword ptr [rsp], r12
0x7f2b691d805c	0	mov qword ptr [rsp+0x8], r15
0x7f2b691d8061	0	mov r15, rdi
0x7f2b691d8064	0	test edx, edx							
0x7f2b691d8066	0	mov r12, rdx
0x7f2b691d8069	0	jz 0x7f2b691d80a2 <Block 8>

I have pasted the wat file of this function below as well for reference.

Wasm Teardown -

Address	Source Line	Assembly
0x7f2b691d80a2	0	lea eax, ptr [rcx+0x1]
0x7f2b691d80a5	0	mov r12, qword ptr [rsp]	
0x7f2b691d80a9	0	mov r15, qword ptr [rsp+0x8]
0x7f2b691d80ae	0	add rsp, 0x10							
0x7f2b691d80b2	0	mov rsp, rbp
0x7f2b691d80b5	0	pop rbp
0x7f2b691d80b6	0	ret

wat of relevant function -

(func (;3;) (type 5) (param i32 i32) (result i32)
    local.get 0
    if ;; label = @1
      loop ;; label = @2
        local.get 1
        if (result i32) ;; label = @3
          local.get 0
          local.get 1
          i32.const 1
          i32.sub
          call 3
        else
          i32.const 1
        end
        local.set 1
        local.get 0
        i32.const 1
        i32.sub
        local.tee 0
        br_if 0 (;@2;)
      end
    end
    local.get 1
    i32.const 1
    i32.add
  )

Native disassembly is pretty short, the entirety of the function is as shown below (this is in at&t syntax, unlike Intel syntax in some above snippets) -

Address	Source Line
0x1170	0	Block 1:							
0x1170	0	pushq  %rbx							
0x1171	0	mov %esi, %eax							
0x1173	0	test %edi, %edi							
0x1175	0	jz 0x119f <Block 8>							
0x1177	0	Block 2:							
0x1177	0	mov %edi, %ebx
0x1179	0	jmp 0x118a <Block 5>
0x117b	0	Block 3:							
0x117b	0	nopl  %eax, (%rax,%rax,1)							
0x1180	0	Block 4:							
0x1180	0	mov $0x1, %eax							
0x1185	0	add $0xffffffff, %ebx							
0x1188	0	jz 0x119f <Block 8>							
0x118a	0	Block 5:							
0x118a	0	test %eax, %eax							
0x118c	0	jz 0x1180 <Block 4>							
0x118e	0	Block 6:							
0x118e	0	add $0xffffffff, %eax							
0x1191	0	mov %ebx, %edi
0x1193	0	mov %eax, %esi
0x1195	0	callq  0x1170 <Block 1>
0x119a	0	Block 7:							
0x119a	0	add $0xffffffff, %ebx
0x119d	0	jnz 0x118a <Block 5>							
0x119f	0	Block 8:							
0x119f	0	add $0x1, %eax
0x11a2	0	popq  %rbx
0x11a3	0	retq

and the C source function to generate wasm and native is -

int ackermann(int M, int N)
{
    if (M == 0)
    {
        return N + 1;
    }
    if (N == 0)
    {
        return ackermann(M - 1, 1);
    }
    return ackermann(M - 1, ackermann(M, (N - 1)));
}

I also tried with --wasm-features tail-call cli flag, however that actually made the perf slightly worse.
Any pointers on the difference in disassembly between native and wasm?

The text was updated successfully, but these errors were encountered:

cfallin · 2024-03-19T02:26:35Z

Hi @rahulchaphalkar -- it looks like the difference is down to two fundamental factors:

We have explicit stack checks rather than implicit stack probes and reliance on guard pages. We've actually just been discussing this in Investigate use of stack probes and removal of explicit stack-limit checks #8135. That's the business with r10 before decrementing rsp.
We have two clobber-saves (r12 and r15), whereas the native code gets away with one (rbx). It would be a good exercise to trace through the assembly and see what the registers are used for; perhaps the native compiler's register allocator is able to be a bit smarter about reuse. It is fundamentally necessary to have some state on the stack I think, since there is a recursive call (the one in non-tail position on the second-to-last line of C) and there is at least one word of state (M) necessary after it returns.

fitzgen · 2024-03-19T15:58:07Z

And FWIW, it is known that the tail calling convention can currently lead to some slow downs, which is why Wasm tail calls aren't enabled by default yet: #6759

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call stack performance investigation #8178

Call stack performance investigation #8178

rahulchaphalkar commented Mar 18, 2024

cfallin commented Mar 19, 2024

fitzgen commented Mar 19, 2024

Call stack performance investigation #8178

Call stack performance investigation #8178

Comments

rahulchaphalkar commented Mar 18, 2024

cfallin commented Mar 19, 2024

fitzgen commented Mar 19, 2024