core/vm: reverse bit order in bytes of code bitmap #24120

chfast · 2021-12-16T09:21:34Z

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

chfast · 2021-12-16T09:46:50Z

The benchmarks looks ok-ish if checking the whole change:

Haswell 4.4 GHz

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.25GB/s ± 0%  1.35GB/s ± 0%   +7.67%  (p=0.000 n=16+16)
JumpdestOpAnalysis/PUSH2-8     1.30GB/s ± 0%  1.31GB/s ± 0%   +1.26%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH3-8     1.94GB/s ± 0%  1.94GB/s ± 0%   +0.03%  (p=0.032 n=17+17)
JumpdestOpAnalysis/PUSH4-8     2.33GB/s ± 0%  2.42GB/s ± 0%   +4.03%  (p=0.000 n=17+17)
JumpdestOpAnalysis/PUSH5-8     2.75GB/s ± 0%  2.75GB/s ± 0%   +0.02%  (p=0.024 n=17+18)
JumpdestOpAnalysis/PUSH6-8     2.87GB/s ± 0%  2.87GB/s ± 0%     ~     (p=0.832 n=18+17)
JumpdestOpAnalysis/PUSH7-8     3.48GB/s ± 0%  3.86GB/s ± 0%  +10.91%  (p=0.000 n=17+19)
JumpdestOpAnalysis/PUSH8-8     3.26GB/s ± 0%  3.56GB/s ± 0%   +8.90%  (p=0.000 n=16+20)
JumpdestOpAnalysis/PUSH9-8     3.35GB/s ± 0%  3.59GB/s ± 1%   +7.27%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH10-8    2.80GB/s ± 0%  2.58GB/s ± 0%   -7.92%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH11-8    2.90GB/s ± 0%  2.90GB/s ± 0%   -0.11%  (p=0.033 n=18+18)
JumpdestOpAnalysis/PUSH12-8    3.26GB/s ± 0%  3.33GB/s ± 0%   +2.22%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH13-8    3.48GB/s ± 0%  3.48GB/s ± 0%     ~     (p=0.060 n=20+17)
JumpdestOpAnalysis/PUSH14-8    3.48GB/s ± 0%  3.50GB/s ± 0%   +0.73%  (p=0.000 n=20+17)
JumpdestOpAnalysis/PUSH15-8    3.51GB/s ± 2%  3.66GB/s ± 0%   +4.26%  (p=0.000 n=18+20)
JumpdestOpAnalysis/PUSH16-8    4.90GB/s ± 0%  4.91GB/s ± 0%   +0.09%  (p=0.001 n=19+16)
JumpdestOpAnalysis/PUSH17-8    5.96GB/s ± 1%  4.87GB/s ± 0%  -18.20%  (p=0.000 n=20+16)
JumpdestOpAnalysis/PUSH18-8    4.53GB/s ± 1%  4.34GB/s ± 0%   -4.19%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH19-8    4.56GB/s ± 0%  4.34GB/s ± 0%   -4.88%  (p=0.000 n=17+17)
JumpdestOpAnalysis/PUSH20-8    4.94GB/s ± 0%  4.78GB/s ± 1%   -3.18%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH21-8    5.15GB/s ± 0%  4.87GB/s ± 1%   -5.32%  (p=0.000 n=17+18)
JumpdestOpAnalysis/PUSH22-8    5.07GB/s ± 0%  4.82GB/s ± 1%   -4.99%  (p=0.000 n=18+20)
JumpdestOpAnalysis/PUSH23-8    5.19GB/s ± 0%  4.93GB/s ± 1%   -4.92%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH24-8    5.40GB/s ± 0%  5.41GB/s ± 0%     ~     (p=0.055 n=18+16)
JumpdestOpAnalysis/PUSH25-8    5.61GB/s ± 0%  5.35GB/s ± 0%   -4.60%  (p=0.000 n=18+17)
JumpdestOpAnalysis/PUSH26-8    5.05GB/s ± 0%  4.87GB/s ± 0%   -3.57%  (p=0.000 n=18+17)
JumpdestOpAnalysis/PUSH27-8    5.05GB/s ± 0%  4.85GB/s ± 0%   -3.87%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH28-8    5.36GB/s ± 0%  5.23GB/s ± 0%   -2.49%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH29-8    5.52GB/s ± 0%  5.30GB/s ± 0%   -3.96%  (p=0.000 n=17+18)
JumpdestOpAnalysis/PUSH30-8    5.44GB/s ± 0%  5.23GB/s ± 0%   -3.83%  (p=0.000 n=18+18)
JumpdestOpAnalysis/PUSH31-8    5.53GB/s ± 0%  5.33GB/s ± 0%   -3.71%  (p=0.000 n=17+16)
JumpdestOpAnalysis/PUSH32-8    6.70GB/s ± 1%  6.47GB/s ± 0%   -3.49%  (p=0.000 n=20+16)
JumpdestOpAnalysis/JUMPDEST-8  1.87GB/s ± 0%  2.49GB/s ± 0%  +33.03%  (p=0.000 n=17+18)
JumpdestOpAnalysis/STOP-8      1.87GB/s ± 0%  2.49GB/s ± 0%  +33.02%  (p=0.000 n=17+17)
[Geo mean]                     3.67GB/s       3.69GB/s        +0.48%

But if you inspect only the second commit which only removes the lookup table for set1 we can see unexpected changes:

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.25GB/s ± 0%  1.35GB/s ± 0%   +8.12%  (p=0.000 n=10+16)
JumpdestOpAnalysis/PUSH2-8     1.44GB/s ± 0%  1.31GB/s ± 0%   -8.64%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH3-8     1.94GB/s ± 0%  1.94GB/s ± 0%   -0.16%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH4-8     2.39GB/s ± 0%  2.42GB/s ± 0%   +1.36%  (p=0.000 n=10+17)
JumpdestOpAnalysis/PUSH5-8     2.66GB/s ± 1%  2.75GB/s ± 0%   +3.58%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH6-8     3.17GB/s ± 0%  2.87GB/s ± 0%   -9.35%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH7-8     3.86GB/s ± 0%  3.86GB/s ± 0%     ~     (p=0.160 n=8+19)
JumpdestOpAnalysis/PUSH8-8     2.71GB/s ± 3%  3.56GB/s ± 0%  +31.34%  (p=0.000 n=10+20)
JumpdestOpAnalysis/PUSH9-8     3.35GB/s ± 0%  3.59GB/s ± 1%   +7.10%  (p=0.000 n=9+19)
JumpdestOpAnalysis/PUSH10-8    2.97GB/s ± 0%  2.58GB/s ± 0%  -13.33%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH11-8    3.07GB/s ± 0%  2.90GB/s ± 0%   -5.64%  (p=0.000 n=8+18)
JumpdestOpAnalysis/PUSH12-8    3.29GB/s ± 0%  3.33GB/s ± 0%   +1.23%  (p=0.000 n=10+18)
JumpdestOpAnalysis/PUSH13-8    3.48GB/s ± 0%  3.48GB/s ± 0%     ~     (p=0.243 n=8+17)
JumpdestOpAnalysis/PUSH14-8    3.70GB/s ± 0%  3.50GB/s ± 0%   -5.32%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH15-8    3.86GB/s ± 0%  3.66GB/s ± 0%   -5.26%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH16-8    4.67GB/s ± 1%  4.91GB/s ± 0%   +5.18%  (p=0.000 n=9+16)
JumpdestOpAnalysis/PUSH17-8    5.97GB/s ± 0%  4.87GB/s ± 0%  -18.43%  (p=0.000 n=8+16)
JumpdestOpAnalysis/PUSH18-8    4.81GB/s ± 0%  4.34GB/s ± 0%   -9.73%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH19-8    4.81GB/s ± 0%  4.34GB/s ± 0%   -9.86%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH20-8    5.00GB/s ± 0%  4.78GB/s ± 1%   -4.44%  (p=0.000 n=10+19)
JumpdestOpAnalysis/PUSH21-8    5.08GB/s ± 0%  4.87GB/s ± 1%   -4.01%  (p=0.000 n=8+18)
JumpdestOpAnalysis/PUSH22-8    5.34GB/s ± 0%  4.82GB/s ± 1%   -9.75%  (p=0.000 n=8+20)
JumpdestOpAnalysis/PUSH23-8    5.46GB/s ± 0%  4.93GB/s ± 1%   -9.65%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH24-8    5.40GB/s ± 0%  5.41GB/s ± 0%     ~     (p=0.336 n=9+16)
JumpdestOpAnalysis/PUSH25-8    5.62GB/s ± 0%  5.35GB/s ± 0%   -4.68%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH26-8    5.28GB/s ± 0%  4.87GB/s ± 0%   -7.68%  (p=0.000 n=8+17)
JumpdestOpAnalysis/PUSH27-8    5.27GB/s ± 0%  4.85GB/s ± 0%   -7.86%  (p=0.000 n=8+17)
JumpdestOpAnalysis/PUSH28-8    5.41GB/s ± 0%  5.23GB/s ± 0%   -3.45%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH29-8    5.51GB/s ± 1%  5.30GB/s ± 0%   -3.79%  (p=0.000 n=10+18)
JumpdestOpAnalysis/PUSH30-8    5.67GB/s ± 0%  5.23GB/s ± 0%   -7.67%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH31-8    5.76GB/s ± 0%  5.33GB/s ± 0%   -7.54%  (p=0.000 n=9+16)
JumpdestOpAnalysis/PUSH32-8    6.77GB/s ± 0%  6.47GB/s ± 0%   -4.44%  (p=0.000 n=8+16)
JumpdestOpAnalysis/JUMPDEST-8  1.87GB/s ± 0%  2.49GB/s ± 0%  +32.99%  (p=0.000 n=8+18)
JumpdestOpAnalysis/STOP-8      1.87GB/s ± 0%  2.49GB/s ± 0%  +33.01%  (p=0.000 n=9+17)
[Geo mean]                     3.75GB/s       3.69GB/s        -1.71%

chfast · 2021-12-16T09:50:12Z

On a Zen3 under external load we got even bigger boost for non-PUSH benchmarks, but also a regression for PUSH1.

name                            old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-12      854MB/s ± 3%   798MB/s ± 3%   -6.57%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH2-12     1.00GB/s ± 2%  1.07GB/s ± 3%   +7.27%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH3-12     1.23GB/s ± 1%  1.22GB/s ± 3%   -1.19%  (p=0.001 n=18+18)
JumpdestOpAnalysis/PUSH4-12     1.57GB/s ± 3%  1.64GB/s ± 2%   +4.55%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH5-12     1.96GB/s ± 2%  1.83GB/s ± 2%   -6.52%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH6-12     2.10GB/s ± 2%  2.14GB/s ± 2%   +2.05%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH7-12     2.14GB/s ± 2%  2.13GB/s ± 2%     ~     (p=0.325 n=19+19)
JumpdestOpAnalysis/PUSH8-12     1.93GB/s ± 2%  2.12GB/s ± 2%   +9.87%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH9-12     2.38GB/s ± 2%  2.34GB/s ± 2%   -1.68%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH10-12    2.11GB/s ± 3%  2.32GB/s ± 1%   +9.83%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH11-12    2.14GB/s ± 3%  2.12GB/s ± 4%     ~     (p=0.070 n=19+19)
JumpdestOpAnalysis/PUSH12-12    2.42GB/s ± 2%  2.52GB/s ± 2%   +3.99%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH13-12    2.56GB/s ± 2%  2.59GB/s ± 2%   +1.26%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH14-12    2.67GB/s ± 2%  2.66GB/s ± 3%     ~     (p=0.588 n=20+19)
JumpdestOpAnalysis/PUSH15-12    2.62GB/s ± 2%  2.61GB/s ± 2%     ~     (p=0.418 n=19+19)
JumpdestOpAnalysis/PUSH16-12    3.02GB/s ± 2%  3.63GB/s ± 2%  +20.21%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH17-12    3.45GB/s ± 3%  3.46GB/s ± 5%     ~     (p=0.607 n=19+20)
JumpdestOpAnalysis/PUSH18-12    3.06GB/s ± 3%  3.59GB/s ± 2%  +17.22%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH19-12    3.06GB/s ± 2%  3.26GB/s ± 3%   +6.75%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH20-12    3.36GB/s ± 2%  3.60GB/s ± 2%   +6.91%  (p=0.000 n=17+20)
JumpdestOpAnalysis/PUSH21-12    3.55GB/s ± 1%  3.57GB/s ± 2%     ~     (p=0.354 n=19+19)
JumpdestOpAnalysis/PUSH22-12    3.47GB/s ± 2%  3.70GB/s ± 3%   +6.85%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH23-12    3.41GB/s ± 3%  3.63GB/s ± 2%   +6.49%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH24-12    3.81GB/s ± 1%  4.09GB/s ± 3%   +7.34%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH25-12    4.24GB/s ± 3%  3.95GB/s ± 2%   -7.00%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH26-12    3.77GB/s ± 3%  3.80GB/s ± 2%     ~     (p=0.116 n=19+19)
JumpdestOpAnalysis/PUSH27-12    3.70GB/s ± 5%  3.71GB/s ± 3%     ~     (p=0.665 n=19+19)
JumpdestOpAnalysis/PUSH28-12    3.97GB/s ± 3%  3.99GB/s ± 3%     ~     (p=0.224 n=19+20)
JumpdestOpAnalysis/PUSH29-12    4.02GB/s ± 2%  3.98GB/s ± 3%   -1.18%  (p=0.029 n=19+18)
JumpdestOpAnalysis/PUSH30-12    4.14GB/s ± 2%  4.22GB/s ± 1%   +2.15%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH31-12    3.99GB/s ± 3%  4.26GB/s ± 2%   +6.59%  (p=0.000 n=20+20)
JumpdestOpAnalysis/PUSH32-12    4.64GB/s ± 3%  4.65GB/s ± 2%     ~     (p=0.583 n=19+19)
JumpdestOpAnalysis/JUMPDEST-12  1.07GB/s ± 2%  1.54GB/s ± 3%  +44.53%  (p=0.000 n=19+19)
JumpdestOpAnalysis/STOP-12      1.06GB/s ± 3%  1.54GB/s ± 2%  +45.23%  (p=0.000 n=19+18)
[Geo mean]                      2.54GB/s       2.66GB/s        +4.92%

chfast · 2021-12-16T10:17:10Z

Assembly diff for lookup table removal.

diff --git a/rev1.asm b/rev2.asm
index 4f7d7310d..c92d58357 100644
--- a/rev1.asm
+++ b/rev2.asm
@@ -12,7 +12,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, CX
                     NOPL
                     CMPQ CX, BX
-                    JBE 0x62ad66
+                    JBE 0x62ad3c
 		op := OpCode(code[pc])
                     MOVZX 0(AX)(CX*1), DX
 		pc++
@@ -20,49 +20,46 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 		if op < PUSH1 || op > PUSH32 {
                     LEAL -0x60(DX), R10
                     CMPL $0x1f, R10
-                    JBE 0x62aac4
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-			continue
-                    JMP 0x62aa9c
+                    JA 0x62aa9c
 		numbits := op - PUSH1 + 1
                     ADDL $-0x5f, DX
+                    NOPW
 		if numbits >= 8 {
                     CMPL $0x8, DL
-                    JAE 0x62ae18
+                    JAE 0x62aded
 		switch numbits {
                     CMPL $0x3, DL
 		case 3:
-                    JA 0x62abd4
+                    JA 0x62abc5
 		case 1:
                     CMPL $0x1, DL
-                    JNE 0x62ab14
+                    JNE 0x62ab03
 			bits.set1(pc)
                     NOPL
-	bits[pos/8] |= lookup[pos%8]
+	bits[pos/8] |= 1 << (pos % 8)
                     MOVQ R9, DX
                     SHRQ $0x3, R9
+                    NOPL
                     CMPQ R9, SI
-                    JBE 0x62adfe
+                    JBE 0x62add3
                     MOVZX 0(DI)(R9*1), R10
                     MOVQ DX, R11
                     ANDQ $0x7, DX
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-                    MOVZX 0(R12)(DX*1), DX
-                    ORL R10, DX
-                    MOVB DL, 0(DI)(R9*1)
+                    BTSL DX, R10
+                    MOVB R10, 0(DI)(R9*1)
 			pc += 1
                     LEAQ 0x1(R11), R9
                     JMP 0x62aa9c
 		case 2:
                     CMPL $0x2, DL
-                    JNE 0x62ab72
+                    JNE 0x62ab60
 			bits.setN(set2BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62adf3
+                    JBE 0x62adc8
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -76,27 +73,27 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     SHRW $0x8, R10
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ab62
+                    JE 0x62ab51
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62ade8
+                    JBE 0x62adbd
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 2
                     LEAQ 0x2(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
                     JMP 0x62aa9c
+                    NOPW 0(AX)(AX*1)
 		switch numbits {
                     CMPL $0x3, DL
 		case 3:
-                    JNE 0x62ad5a
+                    JNE 0x62aa9c
 			bits.setN(set3BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62addd
+                    JBE 0x62adb2
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -108,33 +105,34 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)(AX*1)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62abc4
+                    JE 0x62abb7
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62add2
+                    JBE 0x62ada7
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 3
                     LEAQ 0x3(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
 		switch numbits {
                     CMPL $0x5, DL
 		case 5:
-                    JA 0x62aca0
-                    NOPL 0(AX)
+                    JA 0x62ac85
 		case 4:
                     CMPL $0x4, DL
-                    JNE 0x62ac3e
+                    JNE 0x62ac2a
 			bits.setN(set4BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
+                    NOPL 0(AX)(AX*1)
                     CMPQ R9, SI
-                    JBE 0x62adc7
+                    JBE 0x62ad9c
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -148,15 +146,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     SHRW $0x8, R10
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ac2e
+                    JE 0x62ac21
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62adbc
+                    JBE 0x62ad91
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 4
                     LEAQ 0x4(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
                     JMP 0x62aa9c
 			bits.setN(set5BitsMask, pc)
                     NOPL
@@ -164,7 +161,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62adb1
+                    JBE 0x62ad86
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -176,30 +173,29 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ac8e
+                    JE 0x62ac77
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
-                    NOPL 0(AX)
                     CMPQ R11, SI
-                    JBE 0x62ada6
+                    JBE 0x62ad7b
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 5
                     LEAQ 0x5(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
-                    NOPW
 		case 6:
                     CMPL $0x6, DL
-                    JNE 0x62ad00
+                    JNE 0x62ace5
 			bits.setN(set6BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62ad9b
+                    JBE 0x62ad70
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -211,29 +207,29 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62acee
+                    JE 0x62acd7
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62ad90
+                    JBE 0x62ad65
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 6
                     LEAQ 0x6(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
-                    NOPW
 		case 7:
                     CMPL $0x7, DL
-                    JNE 0x62ad5a
+                    JNE 0x62aa9c
 			bits.setN(set7BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62ad85
+                    JBE 0x62ad5a
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -245,21 +241,17 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ad4a
+                    JE 0x62ad33
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
-                    NOPL 0(AX)
                     CMPQ R11, SI
-                    JBE 0x62ad79
+                    JBE 0x62ad4f
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 7
                     LEAQ 0x7(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-                    JMP 0x62aa9c
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-		switch numbits {
                     JMP 0x62aa9c
 	return bits
                     MOVQ DI, AX
@@ -271,7 +263,6 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 		bits[pos/8+1] = h
                     MOVQ R11, AX
                     MOVQ SI, CX
-                    NOPL
                     CALL runtime.panicIndexU(SB)
 	bits[pos/8] |= l
                     MOVQ R9, AX
@@ -317,7 +308,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)
-	bits[pos/8] |= lookup[pos%8]
+	bits[pos/8] |= 1 << (pos % 8)
                     MOVQ R9, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)
@@ -330,14 +321,15 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     LEAQ 0x10(R10), R9
 			for ; numbits >= 16; numbits -= 16 {
                     CMPL $0x10, DL
-                    JB 0x62ae76
+                    JB 0x62ae56
 				bits.set16(pc)
                     NOPL
 	bits[pos/8] |= a
                     MOVQ R9, CX
                     SHRQ $0x3, R9
+                    NOPW 0(AX)(AX*1)
                     CMPQ R9, SI
-                    JBE 0x62aedd
+                    JBE 0x62aebd
 	a := byte(0xFF << (pos % 8))
                     MOVQ CX, R10
                     ANDQ $0x7, CX
@@ -350,14 +342,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 	bits[pos/8+1] = 0xFF
                     LEAQ 0x1(R9), R12
                     CMPQ R12, SI
-                    JBE 0x62aed2
+                    JBE 0x62aeb2
                     MOVB $0xff, 0x1(R9)(DI*1)
 	bits[pos/8+2] = ^a
                     LEAQ 0x2(R9), R12
                     NOPL 0(AX)
                     CMPQ R12, SI
-                    JA 0x62ae09
-                    JMP 0x62aec7
+                    JA 0x62adde
+                    JMP 0x62aea7
 	bits[pos/8+1] = ^a
                     NOTL R11
                     MOVB R11, 0x1(R9)(DI*1)
@@ -367,14 +359,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     LEAQ 0x8(R10), R9
 			for ; numbits >= 8; numbits -= 8 {
                     CMPL $0x8, DL
-                    JB 0x62aad0
+                    JB 0x62aac9
 				bits.set8(pc)
                     NOPL
 	bits[pos/8] |= a
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62aebc
+                    JBE 0x62ae9c
 	a := byte(0xFF << (pos % 8))
                     MOVQ CX, R10
                     ANDQ $0x7, CX
@@ -387,7 +379,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 	bits[pos/8+1] = ^a
                     LEAQ 0x1(R9), R12
                     CMPQ R12, SI
-                    JA 0x62ae67
+                    JA 0x62ae47
                     MOVQ R12, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)

holiman · 2021-12-16T10:48:39Z

Why should reversing the bits be faster? (and with that I'm not being snarky and saying it isn't, I'm wondering what the theory-behind-the-scenes is)

This bit order is more natural for bit manipulation operations and we can eliminate some small number of CPU instructions.

chfast · 2021-12-16T11:17:25Z

Why should reversing the bits be faster? (and with that I'm not being snarky and saying it isn't, I'm wondering what the theory-behind-the-scenes is)

It is a bit more "natural" for some bit-manip CPU instructions.

E.g. in

func (bits bitvec) set1_x(pos uint64) {
	bits[pos/8] |= 0x80 >> (pos%8)
}

vs

func (bits bitvec) set1(pos uint64) {
	bits[pos/8] |= 1 << (pos%8)
}

the core bit manipulation part

        MOVL    $-128, BX
        SHRB    CL, BL
        ORL     BX, DX

is replaced with a BTS instruction

        BTSL    DX, CX

chfast · 2021-12-16T11:39:57Z

One more benchmark results from a Skylake laptop.

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.29GB/s ± 3%  1.31GB/s ± 1%   +1.41%  (p=0.001 n=30+17)
JumpdestOpAnalysis/PUSH2-8     1.41GB/s ± 5%  1.44GB/s ± 3%   +2.40%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH3-8     1.52GB/s ± 5%  1.73GB/s ± 5%  +14.10%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH4-8     2.13GB/s ± 5%  2.28GB/s ± 5%   +6.87%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH5-8     2.66GB/s ± 5%  2.69GB/s ± 4%     ~     (p=0.211 n=30+20)
JumpdestOpAnalysis/PUSH6-8     2.96GB/s ± 4%  3.09GB/s ± 2%   +4.35%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH7-8     2.71GB/s ± 4%  3.40GB/s ± 2%  +25.53%  (p=0.000 n=29+20)
JumpdestOpAnalysis/PUSH8-8     2.66GB/s ± 3%  3.02GB/s ± 7%  +13.60%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH9-8     2.93GB/s ± 5%  3.16GB/s ± 4%   +7.74%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH10-8    2.96GB/s ± 5%  2.99GB/s ± 3%     ~     (p=0.055 n=30+20)
JumpdestOpAnalysis/PUSH11-8    2.81GB/s ± 4%  3.06GB/s ± 1%   +8.62%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH12-8    3.21GB/s ± 5%  3.39GB/s ± 4%   +5.42%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH13-8    3.43GB/s ± 4%  3.43GB/s ± 3%     ~     (p=0.659 n=30+20)
JumpdestOpAnalysis/PUSH14-8    3.57GB/s ± 5%  3.68GB/s ± 2%   +3.22%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH15-8    3.56GB/s ± 4%  3.84GB/s ± 3%   +7.91%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH16-8    4.98GB/s ± 3%  4.81GB/s ± 6%   -3.30%  (p=0.046 n=30+20)
JumpdestOpAnalysis/PUSH17-8    5.14GB/s ± 5%  5.15GB/s ± 2%     ~     (p=0.205 n=30+20)
JumpdestOpAnalysis/PUSH18-8    4.72GB/s ± 4%  4.54GB/s ± 3%   -3.70%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH19-8    4.40GB/s ± 4%  4.32GB/s ± 6%   -1.83%  (p=0.048 n=30+20)
JumpdestOpAnalysis/PUSH20-8    4.84GB/s ± 5%  4.89GB/s ± 6%     ~     (p=0.083 n=30+20)
JumpdestOpAnalysis/PUSH21-8    4.99GB/s ± 5%  4.63GB/s ± 8%   -7.10%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH22-8    5.12GB/s ± 3%  4.88GB/s ± 7%   -4.73%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH23-8    4.98GB/s ± 4%  5.04GB/s ± 2%   +1.22%  (p=0.008 n=30+17)
JumpdestOpAnalysis/PUSH24-8    5.50GB/s ± 4%  5.39GB/s ± 5%   -1.90%  (p=0.040 n=30+20)
JumpdestOpAnalysis/PUSH25-8    5.36GB/s ± 5%  5.33GB/s ± 3%     ~     (p=0.806 n=30+20)
JumpdestOpAnalysis/PUSH26-8    5.11GB/s ± 4%  5.10GB/s ± 3%     ~     (p=0.837 n=30+20)
JumpdestOpAnalysis/PUSH27-8    4.95GB/s ± 5%  5.07GB/s ± 5%   +2.43%  (p=0.003 n=30+20)
JumpdestOpAnalysis/PUSH28-8    5.30GB/s ± 5%  5.32GB/s ± 2%     ~     (p=0.350 n=30+20)
JumpdestOpAnalysis/PUSH29-8    5.43GB/s ± 5%  5.37GB/s ± 3%     ~     (p=0.073 n=30+20)
JumpdestOpAnalysis/PUSH30-8    5.49GB/s ± 7%  5.47GB/s ± 4%     ~     (p=0.603 n=30+20)
JumpdestOpAnalysis/PUSH31-8    5.44GB/s ± 4%  5.50GB/s ± 4%     ~     (p=0.204 n=30+20)
JumpdestOpAnalysis/PUSH32-8    6.50GB/s ± 5%  6.27GB/s ± 6%   -3.44%  (p=0.016 n=30+20)
JumpdestOpAnalysis/JUMPDEST-8  1.65GB/s ± 4%  1.90GB/s ±15%  +14.63%  (p=0.000 n=30+20)
JumpdestOpAnalysis/STOP-8      1.66GB/s ± 4%  1.90GB/s ±15%  +14.53%  (p=0.011 n=30+20)
[Geo mean]                     3.53GB/s       3.64GB/s        +3.05%

holiman · 2021-12-17T08:50:30Z

On my laptop (master vs this pr - both commits)

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-6     1.03GB/s ± 4%  1.07GB/s ±10%     ~     (p=0.063 n=10+10)
JumpdestOpAnalysis/PUSH2-6     1.15GB/s ± 5%  1.20GB/s ± 4%   +4.11%  (p=0.028 n=10+9)
JumpdestOpAnalysis/PUSH3-6     1.40GB/s ± 6%  1.32GB/s ± 7%   -5.78%  (p=0.004 n=10+10)
JumpdestOpAnalysis/PUSH4-6     1.77GB/s ± 5%  1.86GB/s ± 2%   +5.24%  (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH5-6     2.16GB/s ± 7%  2.23GB/s ± 4%   +3.54%  (p=0.019 n=10+10)
JumpdestOpAnalysis/PUSH6-6     2.36GB/s ± 6%  2.52GB/s ± 5%   +6.82%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH7-6     2.54GB/s ± 7%  2.67GB/s ± 4%   +5.28%  (p=0.001 n=10+10)
JumpdestOpAnalysis/PUSH8-6     1.99GB/s ± 2%  2.57GB/s ± 5%  +29.06%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH9-6     2.36GB/s ± 7%  2.59GB/s ± 6%   +9.69%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH10-6    2.41GB/s ± 6%  2.41GB/s ± 1%     ~     (p=0.460 n=10+8)
JumpdestOpAnalysis/PUSH11-6    2.22GB/s ± 6%  2.49GB/s ± 9%  +12.04%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH12-6    2.60GB/s ± 5%  2.81GB/s ± 3%   +8.15%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH13-6    2.83GB/s ± 4%  2.82GB/s ± 5%     ~     (p=1.000 n=9+10)
JumpdestOpAnalysis/PUSH14-6    2.86GB/s ± 8%  3.08GB/s ± 1%   +7.54%  (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH15-6    2.92GB/s ± 5%  3.15GB/s ± 4%   +7.73%  (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH16-6    4.05GB/s ± 5%  4.03GB/s ± 6%     ~     (p=0.529 n=10+10)
JumpdestOpAnalysis/PUSH17-6    4.14GB/s ± 6%  4.11GB/s ± 1%     ~     (p=0.897 n=10+8)
JumpdestOpAnalysis/PUSH18-6    3.81GB/s ± 5%  3.74GB/s ± 6%     ~     (p=0.661 n=10+9)
JumpdestOpAnalysis/PUSH19-6    3.64GB/s ± 7%  3.26GB/s ±32%     ~     (p=0.105 n=10+10)
JumpdestOpAnalysis/PUSH20-6    3.90GB/s ± 5%  3.93GB/s ± 9%     ~     (p=0.400 n=9+10)
JumpdestOpAnalysis/PUSH21-6    4.03GB/s ± 8%  3.88GB/s ± 7%   -3.53%  (p=0.043 n=10+9)
JumpdestOpAnalysis/PUSH22-6    4.25GB/s ± 2%  4.14GB/s ±10%     ~     (p=0.546 n=9+9)
JumpdestOpAnalysis/PUSH23-6    4.25GB/s ± 3%  4.10GB/s ± 7%   -3.53%  (p=0.006 n=8+10)
JumpdestOpAnalysis/PUSH24-6    4.52GB/s ± 4%  4.46GB/s ± 8%     ~     (p=0.604 n=9+10)
JumpdestOpAnalysis/PUSH25-6    4.29GB/s ± 7%  4.31GB/s ± 9%     ~     (p=0.684 n=10+10)
JumpdestOpAnalysis/PUSH26-6    4.11GB/s ± 5%  4.07GB/s ±11%     ~     (p=0.853 n=10+10)
JumpdestOpAnalysis/PUSH27-6    4.04GB/s ± 6%  4.06GB/s ± 8%     ~     (p=0.796 n=10+10)
JumpdestOpAnalysis/PUSH28-6    4.27GB/s ± 9%  4.21GB/s ± 3%     ~     (p=0.143 n=10+10)
JumpdestOpAnalysis/PUSH29-6    4.37GB/s ± 4%  4.35GB/s ± 9%     ~     (p=1.000 n=10+10)
JumpdestOpAnalysis/PUSH30-6    4.38GB/s ± 5%  4.42GB/s ± 5%     ~     (p=0.684 n=10+10)
JumpdestOpAnalysis/PUSH31-6    4.38GB/s ± 3%  4.46GB/s ± 6%     ~     (p=0.165 n=10+10)
JumpdestOpAnalysis/PUSH32-6    5.15GB/s ± 1%  5.15GB/s ± 7%     ~     (p=0.958 n=6+10)
JumpdestOpAnalysis/JUMPDEST-6  1.36GB/s ± 5%  1.55GB/s ± 5%  +14.30%  (p=0.000 n=10+10)
JumpdestOpAnalysis/STOP-6      1.37GB/s ± 4%  1.56GB/s ± 5%  +13.45%  (p=0.000 n=9+10)

So in summary: the worst cases were improved or not changed. LGTM

holiman

LGTM

* core/vm: reverse bit order in bytes of code bitmap This bit order is more natural for bit manipulation operations and we can eliminate some small number of CPU instructions. * core/vm: drop lookup table

fjl changed the title ~~Reverse bit order in bytes of code bitmap~~ core/vm: reverse bit order in bytes of code bitmap Dec 16, 2021

chfast force-pushed the analysis_reversed branch from 2917c6a to a4418ae Compare December 16, 2021 11:02

chfast added 2 commits December 16, 2021 12:04

core/vm: reverse bit order in bytes of code bitmap

a1d7279

This bit order is more natural for bit manipulation operations and we can eliminate some small number of CPU instructions.

core/vm: drop lookup table

a864a7e

chfast force-pushed the analysis_reversed branch from a4418ae to a864a7e Compare December 16, 2021 11:04

chfast marked this pull request as ready for review December 16, 2021 11:58

chfast requested review from holiman, karalabe and rjl493456442 as code owners December 16, 2021 11:58

holiman approved these changes Dec 17, 2021

View reviewed changes

holiman added this to the 1.10.14 milestone Dec 17, 2021

holiman merged commit 81ec6b1 into ethereum:master Dec 17, 2021

holiman deleted the analysis_reversed branch December 17, 2021 09:32

gzliudan mentioned this pull request Feb 27, 2024

preparation for solidity v0.8.23 upgrade XinFinOrg/XDPoSChain#452

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/vm: reverse bit order in bytes of code bitmap #24120

core/vm: reverse bit order in bytes of code bitmap #24120

chfast commented Dec 16, 2021 •

edited

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

holiman commented Dec 16, 2021

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

holiman commented Dec 17, 2021

holiman left a comment

core/vm: reverse bit order in bytes of code bitmap #24120

core/vm: reverse bit order in bytes of code bitmap #24120

Conversation

chfast commented Dec 16, 2021 • edited

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

holiman commented Dec 16, 2021

chfast commented Dec 16, 2021

chfast commented Dec 16, 2021

holiman commented Dec 17, 2021

holiman left a comment

Choose a reason for hiding this comment

chfast commented Dec 16, 2021 •

edited