Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/vm: reverse bit order in bytes of code bitmap #24120

Merged
merged 2 commits into from Dec 17, 2021

Conversation

chfast
Copy link
Contributor

@chfast chfast commented Dec 16, 2021

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

@chfast
Copy link
Contributor Author

chfast commented Dec 16, 2021

The benchmarks looks ok-ish if checking the whole change:

Haswell 4.4 GHz

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.25GB/s ± 0%  1.35GB/s ± 0%   +7.67%  (p=0.000 n=16+16)
JumpdestOpAnalysis/PUSH2-8     1.30GB/s ± 0%  1.31GB/s ± 0%   +1.26%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH3-8     1.94GB/s ± 0%  1.94GB/s ± 0%   +0.03%  (p=0.032 n=17+17)
JumpdestOpAnalysis/PUSH4-8     2.33GB/s ± 0%  2.42GB/s ± 0%   +4.03%  (p=0.000 n=17+17)
JumpdestOpAnalysis/PUSH5-8     2.75GB/s ± 0%  2.75GB/s ± 0%   +0.02%  (p=0.024 n=17+18)
JumpdestOpAnalysis/PUSH6-8     2.87GB/s ± 0%  2.87GB/s ± 0%     ~     (p=0.832 n=18+17)
JumpdestOpAnalysis/PUSH7-8     3.48GB/s ± 0%  3.86GB/s ± 0%  +10.91%  (p=0.000 n=17+19)
JumpdestOpAnalysis/PUSH8-8     3.26GB/s ± 0%  3.56GB/s ± 0%   +8.90%  (p=0.000 n=16+20)
JumpdestOpAnalysis/PUSH9-8     3.35GB/s ± 0%  3.59GB/s ± 1%   +7.27%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH10-8    2.80GB/s ± 0%  2.58GB/s ± 0%   -7.92%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH11-8    2.90GB/s ± 0%  2.90GB/s ± 0%   -0.11%  (p=0.033 n=18+18)
JumpdestOpAnalysis/PUSH12-8    3.26GB/s ± 0%  3.33GB/s ± 0%   +2.22%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH13-8    3.48GB/s ± 0%  3.48GB/s ± 0%     ~     (p=0.060 n=20+17)
JumpdestOpAnalysis/PUSH14-8    3.48GB/s ± 0%  3.50GB/s ± 0%   +0.73%  (p=0.000 n=20+17)
JumpdestOpAnalysis/PUSH15-8    3.51GB/s ± 2%  3.66GB/s ± 0%   +4.26%  (p=0.000 n=18+20)
JumpdestOpAnalysis/PUSH16-8    4.90GB/s ± 0%  4.91GB/s ± 0%   +0.09%  (p=0.001 n=19+16)
JumpdestOpAnalysis/PUSH17-8    5.96GB/s ± 1%  4.87GB/s ± 0%  -18.20%  (p=0.000 n=20+16)
JumpdestOpAnalysis/PUSH18-8    4.53GB/s ± 1%  4.34GB/s ± 0%   -4.19%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH19-8    4.56GB/s ± 0%  4.34GB/s ± 0%   -4.88%  (p=0.000 n=17+17)
JumpdestOpAnalysis/PUSH20-8    4.94GB/s ± 0%  4.78GB/s ± 1%   -3.18%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH21-8    5.15GB/s ± 0%  4.87GB/s ± 1%   -5.32%  (p=0.000 n=17+18)
JumpdestOpAnalysis/PUSH22-8    5.07GB/s ± 0%  4.82GB/s ± 1%   -4.99%  (p=0.000 n=18+20)
JumpdestOpAnalysis/PUSH23-8    5.19GB/s ± 0%  4.93GB/s ± 1%   -4.92%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH24-8    5.40GB/s ± 0%  5.41GB/s ± 0%     ~     (p=0.055 n=18+16)
JumpdestOpAnalysis/PUSH25-8    5.61GB/s ± 0%  5.35GB/s ± 0%   -4.60%  (p=0.000 n=18+17)
JumpdestOpAnalysis/PUSH26-8    5.05GB/s ± 0%  4.87GB/s ± 0%   -3.57%  (p=0.000 n=18+17)
JumpdestOpAnalysis/PUSH27-8    5.05GB/s ± 0%  4.85GB/s ± 0%   -3.87%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH28-8    5.36GB/s ± 0%  5.23GB/s ± 0%   -2.49%  (p=0.000 n=19+17)
JumpdestOpAnalysis/PUSH29-8    5.52GB/s ± 0%  5.30GB/s ± 0%   -3.96%  (p=0.000 n=17+18)
JumpdestOpAnalysis/PUSH30-8    5.44GB/s ± 0%  5.23GB/s ± 0%   -3.83%  (p=0.000 n=18+18)
JumpdestOpAnalysis/PUSH31-8    5.53GB/s ± 0%  5.33GB/s ± 0%   -3.71%  (p=0.000 n=17+16)
JumpdestOpAnalysis/PUSH32-8    6.70GB/s ± 1%  6.47GB/s ± 0%   -3.49%  (p=0.000 n=20+16)
JumpdestOpAnalysis/JUMPDEST-8  1.87GB/s ± 0%  2.49GB/s ± 0%  +33.03%  (p=0.000 n=17+18)
JumpdestOpAnalysis/STOP-8      1.87GB/s ± 0%  2.49GB/s ± 0%  +33.02%  (p=0.000 n=17+17)
[Geo mean]                     3.67GB/s       3.69GB/s        +0.48%

But if you inspect only the second commit which only removes the lookup table for set1 we can see unexpected changes:

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.25GB/s ± 0%  1.35GB/s ± 0%   +8.12%  (p=0.000 n=10+16)
JumpdestOpAnalysis/PUSH2-8     1.44GB/s ± 0%  1.31GB/s ± 0%   -8.64%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH3-8     1.94GB/s ± 0%  1.94GB/s ± 0%   -0.16%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH4-8     2.39GB/s ± 0%  2.42GB/s ± 0%   +1.36%  (p=0.000 n=10+17)
JumpdestOpAnalysis/PUSH5-8     2.66GB/s ± 1%  2.75GB/s ± 0%   +3.58%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH6-8     3.17GB/s ± 0%  2.87GB/s ± 0%   -9.35%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH7-8     3.86GB/s ± 0%  3.86GB/s ± 0%     ~     (p=0.160 n=8+19)
JumpdestOpAnalysis/PUSH8-8     2.71GB/s ± 3%  3.56GB/s ± 0%  +31.34%  (p=0.000 n=10+20)
JumpdestOpAnalysis/PUSH9-8     3.35GB/s ± 0%  3.59GB/s ± 1%   +7.10%  (p=0.000 n=9+19)
JumpdestOpAnalysis/PUSH10-8    2.97GB/s ± 0%  2.58GB/s ± 0%  -13.33%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH11-8    3.07GB/s ± 0%  2.90GB/s ± 0%   -5.64%  (p=0.000 n=8+18)
JumpdestOpAnalysis/PUSH12-8    3.29GB/s ± 0%  3.33GB/s ± 0%   +1.23%  (p=0.000 n=10+18)
JumpdestOpAnalysis/PUSH13-8    3.48GB/s ± 0%  3.48GB/s ± 0%     ~     (p=0.243 n=8+17)
JumpdestOpAnalysis/PUSH14-8    3.70GB/s ± 0%  3.50GB/s ± 0%   -5.32%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH15-8    3.86GB/s ± 0%  3.66GB/s ± 0%   -5.26%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH16-8    4.67GB/s ± 1%  4.91GB/s ± 0%   +5.18%  (p=0.000 n=9+16)
JumpdestOpAnalysis/PUSH17-8    5.97GB/s ± 0%  4.87GB/s ± 0%  -18.43%  (p=0.000 n=8+16)
JumpdestOpAnalysis/PUSH18-8    4.81GB/s ± 0%  4.34GB/s ± 0%   -9.73%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH19-8    4.81GB/s ± 0%  4.34GB/s ± 0%   -9.86%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH20-8    5.00GB/s ± 0%  4.78GB/s ± 1%   -4.44%  (p=0.000 n=10+19)
JumpdestOpAnalysis/PUSH21-8    5.08GB/s ± 0%  4.87GB/s ± 1%   -4.01%  (p=0.000 n=8+18)
JumpdestOpAnalysis/PUSH22-8    5.34GB/s ± 0%  4.82GB/s ± 1%   -9.75%  (p=0.000 n=8+20)
JumpdestOpAnalysis/PUSH23-8    5.46GB/s ± 0%  4.93GB/s ± 1%   -9.65%  (p=0.000 n=9+20)
JumpdestOpAnalysis/PUSH24-8    5.40GB/s ± 0%  5.41GB/s ± 0%     ~     (p=0.336 n=9+16)
JumpdestOpAnalysis/PUSH25-8    5.62GB/s ± 0%  5.35GB/s ± 0%   -4.68%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH26-8    5.28GB/s ± 0%  4.87GB/s ± 0%   -7.68%  (p=0.000 n=8+17)
JumpdestOpAnalysis/PUSH27-8    5.27GB/s ± 0%  4.85GB/s ± 0%   -7.86%  (p=0.000 n=8+17)
JumpdestOpAnalysis/PUSH28-8    5.41GB/s ± 0%  5.23GB/s ± 0%   -3.45%  (p=0.000 n=9+17)
JumpdestOpAnalysis/PUSH29-8    5.51GB/s ± 1%  5.30GB/s ± 0%   -3.79%  (p=0.000 n=10+18)
JumpdestOpAnalysis/PUSH30-8    5.67GB/s ± 0%  5.23GB/s ± 0%   -7.67%  (p=0.000 n=9+18)
JumpdestOpAnalysis/PUSH31-8    5.76GB/s ± 0%  5.33GB/s ± 0%   -7.54%  (p=0.000 n=9+16)
JumpdestOpAnalysis/PUSH32-8    6.77GB/s ± 0%  6.47GB/s ± 0%   -4.44%  (p=0.000 n=8+16)
JumpdestOpAnalysis/JUMPDEST-8  1.87GB/s ± 0%  2.49GB/s ± 0%  +32.99%  (p=0.000 n=8+18)
JumpdestOpAnalysis/STOP-8      1.87GB/s ± 0%  2.49GB/s ± 0%  +33.01%  (p=0.000 n=9+17)
[Geo mean]                     3.75GB/s       3.69GB/s        -1.71%

@chfast
Copy link
Contributor Author

chfast commented Dec 16, 2021

On a Zen3 under external load we got even bigger boost for non-PUSH benchmarks, but also a regression for PUSH1.

name                            old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-12      854MB/s ± 3%   798MB/s ± 3%   -6.57%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH2-12     1.00GB/s ± 2%  1.07GB/s ± 3%   +7.27%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH3-12     1.23GB/s ± 1%  1.22GB/s ± 3%   -1.19%  (p=0.001 n=18+18)
JumpdestOpAnalysis/PUSH4-12     1.57GB/s ± 3%  1.64GB/s ± 2%   +4.55%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH5-12     1.96GB/s ± 2%  1.83GB/s ± 2%   -6.52%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH6-12     2.10GB/s ± 2%  2.14GB/s ± 2%   +2.05%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH7-12     2.14GB/s ± 2%  2.13GB/s ± 2%     ~     (p=0.325 n=19+19)
JumpdestOpAnalysis/PUSH8-12     1.93GB/s ± 2%  2.12GB/s ± 2%   +9.87%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH9-12     2.38GB/s ± 2%  2.34GB/s ± 2%   -1.68%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH10-12    2.11GB/s ± 3%  2.32GB/s ± 1%   +9.83%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH11-12    2.14GB/s ± 3%  2.12GB/s ± 4%     ~     (p=0.070 n=19+19)
JumpdestOpAnalysis/PUSH12-12    2.42GB/s ± 2%  2.52GB/s ± 2%   +3.99%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH13-12    2.56GB/s ± 2%  2.59GB/s ± 2%   +1.26%  (p=0.000 n=19+20)
JumpdestOpAnalysis/PUSH14-12    2.67GB/s ± 2%  2.66GB/s ± 3%     ~     (p=0.588 n=20+19)
JumpdestOpAnalysis/PUSH15-12    2.62GB/s ± 2%  2.61GB/s ± 2%     ~     (p=0.418 n=19+19)
JumpdestOpAnalysis/PUSH16-12    3.02GB/s ± 2%  3.63GB/s ± 2%  +20.21%  (p=0.000 n=19+19)
JumpdestOpAnalysis/PUSH17-12    3.45GB/s ± 3%  3.46GB/s ± 5%     ~     (p=0.607 n=19+20)
JumpdestOpAnalysis/PUSH18-12    3.06GB/s ± 3%  3.59GB/s ± 2%  +17.22%  (p=0.000 n=20+18)
JumpdestOpAnalysis/PUSH19-12    3.06GB/s ± 2%  3.26GB/s ± 3%   +6.75%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH20-12    3.36GB/s ± 2%  3.60GB/s ± 2%   +6.91%  (p=0.000 n=17+20)
JumpdestOpAnalysis/PUSH21-12    3.55GB/s ± 1%  3.57GB/s ± 2%     ~     (p=0.354 n=19+19)
JumpdestOpAnalysis/PUSH22-12    3.47GB/s ± 2%  3.70GB/s ± 3%   +6.85%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH23-12    3.41GB/s ± 3%  3.63GB/s ± 2%   +6.49%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH24-12    3.81GB/s ± 1%  4.09GB/s ± 3%   +7.34%  (p=0.000 n=20+19)
JumpdestOpAnalysis/PUSH25-12    4.24GB/s ± 3%  3.95GB/s ± 2%   -7.00%  (p=0.000 n=19+18)
JumpdestOpAnalysis/PUSH26-12    3.77GB/s ± 3%  3.80GB/s ± 2%     ~     (p=0.116 n=19+19)
JumpdestOpAnalysis/PUSH27-12    3.70GB/s ± 5%  3.71GB/s ± 3%     ~     (p=0.665 n=19+19)
JumpdestOpAnalysis/PUSH28-12    3.97GB/s ± 3%  3.99GB/s ± 3%     ~     (p=0.224 n=19+20)
JumpdestOpAnalysis/PUSH29-12    4.02GB/s ± 2%  3.98GB/s ± 3%   -1.18%  (p=0.029 n=19+18)
JumpdestOpAnalysis/PUSH30-12    4.14GB/s ± 2%  4.22GB/s ± 1%   +2.15%  (p=0.000 n=18+19)
JumpdestOpAnalysis/PUSH31-12    3.99GB/s ± 3%  4.26GB/s ± 2%   +6.59%  (p=0.000 n=20+20)
JumpdestOpAnalysis/PUSH32-12    4.64GB/s ± 3%  4.65GB/s ± 2%     ~     (p=0.583 n=19+19)
JumpdestOpAnalysis/JUMPDEST-12  1.07GB/s ± 2%  1.54GB/s ± 3%  +44.53%  (p=0.000 n=19+19)
JumpdestOpAnalysis/STOP-12      1.06GB/s ± 3%  1.54GB/s ± 2%  +45.23%  (p=0.000 n=19+18)
[Geo mean]                      2.54GB/s       2.66GB/s        +4.92%

@chfast
Copy link
Contributor Author

chfast commented Dec 16, 2021

Assembly diff for lookup table removal.

diff --git a/rev1.asm b/rev2.asm
index 4f7d7310d..c92d58357 100644
--- a/rev1.asm
+++ b/rev2.asm
@@ -12,7 +12,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, CX
                     NOPL
                     CMPQ CX, BX
-                    JBE 0x62ad66
+                    JBE 0x62ad3c
 		op := OpCode(code[pc])
                     MOVZX 0(AX)(CX*1), DX
 		pc++
@@ -20,49 +20,46 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 		if op < PUSH1 || op > PUSH32 {
                     LEAL -0x60(DX), R10
                     CMPL $0x1f, R10
-                    JBE 0x62aac4
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-			continue
-                    JMP 0x62aa9c
+                    JA 0x62aa9c
 		numbits := op - PUSH1 + 1
                     ADDL $-0x5f, DX
+                    NOPW
 		if numbits >= 8 {
                     CMPL $0x8, DL
-                    JAE 0x62ae18
+                    JAE 0x62aded
 		switch numbits {
                     CMPL $0x3, DL
 		case 3:
-                    JA 0x62abd4
+                    JA 0x62abc5
 		case 1:
                     CMPL $0x1, DL
-                    JNE 0x62ab14
+                    JNE 0x62ab03
 			bits.set1(pc)
                     NOPL
-	bits[pos/8] |= lookup[pos%8]
+	bits[pos/8] |= 1 << (pos % 8)
                     MOVQ R9, DX
                     SHRQ $0x3, R9
+                    NOPL
                     CMPQ R9, SI
-                    JBE 0x62adfe
+                    JBE 0x62add3
                     MOVZX 0(DI)(R9*1), R10
                     MOVQ DX, R11
                     ANDQ $0x7, DX
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-                    MOVZX 0(R12)(DX*1), DX
-                    ORL R10, DX
-                    MOVB DL, 0(DI)(R9*1)
+                    BTSL DX, R10
+                    MOVB R10, 0(DI)(R9*1)
 			pc += 1
                     LEAQ 0x1(R11), R9
                     JMP 0x62aa9c
 		case 2:
                     CMPL $0x2, DL
-                    JNE 0x62ab72
+                    JNE 0x62ab60
 			bits.setN(set2BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62adf3
+                    JBE 0x62adc8
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -76,27 +73,27 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     SHRW $0x8, R10
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ab62
+                    JE 0x62ab51
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62ade8
+                    JBE 0x62adbd
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 2
                     LEAQ 0x2(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
                     JMP 0x62aa9c
+                    NOPW 0(AX)(AX*1)
 		switch numbits {
                     CMPL $0x3, DL
 		case 3:
-                    JNE 0x62ad5a
+                    JNE 0x62aa9c
 			bits.setN(set3BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62addd
+                    JBE 0x62adb2
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -108,33 +105,34 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)(AX*1)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62abc4
+                    JE 0x62abb7
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62add2
+                    JBE 0x62ada7
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 3
                     LEAQ 0x3(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
 		switch numbits {
                     CMPL $0x5, DL
 		case 5:
-                    JA 0x62aca0
-                    NOPL 0(AX)
+                    JA 0x62ac85
 		case 4:
                     CMPL $0x4, DL
-                    JNE 0x62ac3e
+                    JNE 0x62ac2a
 			bits.setN(set4BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
+                    NOPL 0(AX)(AX*1)
                     CMPQ R9, SI
-                    JBE 0x62adc7
+                    JBE 0x62ad9c
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -148,15 +146,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     SHRW $0x8, R10
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ac2e
+                    JE 0x62ac21
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62adbc
+                    JBE 0x62ad91
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 4
                     LEAQ 0x4(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
                     JMP 0x62aa9c
 			bits.setN(set5BitsMask, pc)
                     NOPL
@@ -164,7 +161,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62adb1
+                    JBE 0x62ad86
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -176,30 +173,29 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ac8e
+                    JE 0x62ac77
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
-                    NOPL 0(AX)
                     CMPQ R11, SI
-                    JBE 0x62ada6
+                    JBE 0x62ad7b
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 5
                     LEAQ 0x5(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
-                    NOPW
 		case 6:
                     CMPL $0x6, DL
-                    JNE 0x62ad00
+                    JNE 0x62ace5
 			bits.setN(set6BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62ad9b
+                    JBE 0x62ad70
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -211,29 +207,29 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62acee
+                    JE 0x62acd7
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
                     CMPQ R11, SI
-                    JBE 0x62ad90
+                    JBE 0x62ad65
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 6
                     LEAQ 0x6(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
+                    NOPL 0(AX)(AX*1)
                     JMP 0x62aa9c
-                    NOPW
 		case 7:
                     CMPL $0x7, DL
-                    JNE 0x62ad5a
+                    JNE 0x62aa9c
 			bits.setN(set7BitsMask, pc)
                     NOPL
 	bits[pos/8] |= l
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62ad85
+                    JBE 0x62ad5a
 	a := flag << (pos % 8)
                     MOVQ CX, DX
                     ANDQ $0x7, CX
@@ -245,21 +241,17 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVB R11, 0(DI)(R9*1)
 	h := byte(a >> 8)
                     SHRW $0x8, R10
+                    NOPL 0(AX)
 	if h != 0 {
                     TESTL R10, R10
-                    JE 0x62ad4a
+                    JE 0x62ad33
 		bits[pos/8+1] = h
                     LEAQ 0x1(R9), R11
-                    NOPL 0(AX)
                     CMPQ R11, SI
-                    JBE 0x62ad79
+                    JBE 0x62ad4f
                     MOVB R10, 0x1(R9)(DI*1)
 			pc += 7
                     LEAQ 0x7(DX), R9
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-                    JMP 0x62aa9c
-                    LEAQ github.com/ethereum/go-ethereum/core/vm.lookup(SB), R12
-		switch numbits {
                     JMP 0x62aa9c
 	return bits
                     MOVQ DI, AX
@@ -271,7 +263,6 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 		bits[pos/8+1] = h
                     MOVQ R11, AX
                     MOVQ SI, CX
-                    NOPL
                     CALL runtime.panicIndexU(SB)
 	bits[pos/8] |= l
                     MOVQ R9, AX
@@ -317,7 +308,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     MOVQ R9, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)
-	bits[pos/8] |= lookup[pos%8]
+	bits[pos/8] |= 1 << (pos % 8)
                     MOVQ R9, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)
@@ -330,14 +321,15 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     LEAQ 0x10(R10), R9
 			for ; numbits >= 16; numbits -= 16 {
                     CMPL $0x10, DL
-                    JB 0x62ae76
+                    JB 0x62ae56
 				bits.set16(pc)
                     NOPL
 	bits[pos/8] |= a
                     MOVQ R9, CX
                     SHRQ $0x3, R9
+                    NOPW 0(AX)(AX*1)
                     CMPQ R9, SI
-                    JBE 0x62aedd
+                    JBE 0x62aebd
 	a := byte(0xFF << (pos % 8))
                     MOVQ CX, R10
                     ANDQ $0x7, CX
@@ -350,14 +342,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 	bits[pos/8+1] = 0xFF
                     LEAQ 0x1(R9), R12
                     CMPQ R12, SI
-                    JBE 0x62aed2
+                    JBE 0x62aeb2
                     MOVB $0xff, 0x1(R9)(DI*1)
 	bits[pos/8+2] = ^a
                     LEAQ 0x2(R9), R12
                     NOPL 0(AX)
                     CMPQ R12, SI
-                    JA 0x62ae09
-                    JMP 0x62aec7
+                    JA 0x62adde
+                    JMP 0x62aea7
 	bits[pos/8+1] = ^a
                     NOTL R11
                     MOVB R11, 0x1(R9)(DI*1)
@@ -367,14 +359,14 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
                     LEAQ 0x8(R10), R9
 			for ; numbits >= 8; numbits -= 8 {
                     CMPL $0x8, DL
-                    JB 0x62aad0
+                    JB 0x62aac9
 				bits.set8(pc)
                     NOPL
 	bits[pos/8] |= a
                     MOVQ R9, CX
                     SHRQ $0x3, R9
                     CMPQ R9, SI
-                    JBE 0x62aebc
+                    JBE 0x62ae9c
 	a := byte(0xFF << (pos % 8))
                     MOVQ CX, R10
                     ANDQ $0x7, CX
@@ -387,7 +379,7 @@ func codeBitmapInternal(code, bits bitvec) bitvec {
 	bits[pos/8+1] = ^a
                     LEAQ 0x1(R9), R12
                     CMPQ R12, SI
-                    JA 0x62ae67
+                    JA 0x62ae47
                     MOVQ R12, AX
                     MOVQ SI, CX
                     CALL runtime.panicIndexU(SB)

@holiman
Copy link
Contributor

holiman commented Dec 16, 2021

Why should reversing the bits be faster? (and with that I'm not being snarky and saying it isn't, I'm wondering what the theory-behind-the-scenes is)

@fjl fjl changed the title Reverse bit order in bytes of code bitmap core/vm: reverse bit order in bytes of code bitmap Dec 16, 2021
This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.
@chfast
Copy link
Contributor Author

chfast commented Dec 16, 2021

Why should reversing the bits be faster? (and with that I'm not being snarky and saying it isn't, I'm wondering what the theory-behind-the-scenes is)

It is a bit more "natural" for some bit-manip CPU instructions.

E.g. in

func (bits bitvec) set1_x(pos uint64) {
	bits[pos/8] |= 0x80 >> (pos%8)
}

vs

func (bits bitvec) set1(pos uint64) {
	bits[pos/8] |= 1 << (pos%8)
}

the core bit manipulation part

        MOVL    $-128, BX
        SHRB    CL, BL
        ORL     BX, DX

is replaced with a BTS instruction

        BTSL    DX, CX

@chfast
Copy link
Contributor Author

chfast commented Dec 16, 2021

One more benchmark results from a Skylake laptop.

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-8     1.29GB/s ± 3%  1.31GB/s ± 1%   +1.41%  (p=0.001 n=30+17)
JumpdestOpAnalysis/PUSH2-8     1.41GB/s ± 5%  1.44GB/s ± 3%   +2.40%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH3-8     1.52GB/s ± 5%  1.73GB/s ± 5%  +14.10%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH4-8     2.13GB/s ± 5%  2.28GB/s ± 5%   +6.87%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH5-8     2.66GB/s ± 5%  2.69GB/s ± 4%     ~     (p=0.211 n=30+20)
JumpdestOpAnalysis/PUSH6-8     2.96GB/s ± 4%  3.09GB/s ± 2%   +4.35%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH7-8     2.71GB/s ± 4%  3.40GB/s ± 2%  +25.53%  (p=0.000 n=29+20)
JumpdestOpAnalysis/PUSH8-8     2.66GB/s ± 3%  3.02GB/s ± 7%  +13.60%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH9-8     2.93GB/s ± 5%  3.16GB/s ± 4%   +7.74%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH10-8    2.96GB/s ± 5%  2.99GB/s ± 3%     ~     (p=0.055 n=30+20)
JumpdestOpAnalysis/PUSH11-8    2.81GB/s ± 4%  3.06GB/s ± 1%   +8.62%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH12-8    3.21GB/s ± 5%  3.39GB/s ± 4%   +5.42%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH13-8    3.43GB/s ± 4%  3.43GB/s ± 3%     ~     (p=0.659 n=30+20)
JumpdestOpAnalysis/PUSH14-8    3.57GB/s ± 5%  3.68GB/s ± 2%   +3.22%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH15-8    3.56GB/s ± 4%  3.84GB/s ± 3%   +7.91%  (p=0.000 n=30+19)
JumpdestOpAnalysis/PUSH16-8    4.98GB/s ± 3%  4.81GB/s ± 6%   -3.30%  (p=0.046 n=30+20)
JumpdestOpAnalysis/PUSH17-8    5.14GB/s ± 5%  5.15GB/s ± 2%     ~     (p=0.205 n=30+20)
JumpdestOpAnalysis/PUSH18-8    4.72GB/s ± 4%  4.54GB/s ± 3%   -3.70%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH19-8    4.40GB/s ± 4%  4.32GB/s ± 6%   -1.83%  (p=0.048 n=30+20)
JumpdestOpAnalysis/PUSH20-8    4.84GB/s ± 5%  4.89GB/s ± 6%     ~     (p=0.083 n=30+20)
JumpdestOpAnalysis/PUSH21-8    4.99GB/s ± 5%  4.63GB/s ± 8%   -7.10%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH22-8    5.12GB/s ± 3%  4.88GB/s ± 7%   -4.73%  (p=0.000 n=30+20)
JumpdestOpAnalysis/PUSH23-8    4.98GB/s ± 4%  5.04GB/s ± 2%   +1.22%  (p=0.008 n=30+17)
JumpdestOpAnalysis/PUSH24-8    5.50GB/s ± 4%  5.39GB/s ± 5%   -1.90%  (p=0.040 n=30+20)
JumpdestOpAnalysis/PUSH25-8    5.36GB/s ± 5%  5.33GB/s ± 3%     ~     (p=0.806 n=30+20)
JumpdestOpAnalysis/PUSH26-8    5.11GB/s ± 4%  5.10GB/s ± 3%     ~     (p=0.837 n=30+20)
JumpdestOpAnalysis/PUSH27-8    4.95GB/s ± 5%  5.07GB/s ± 5%   +2.43%  (p=0.003 n=30+20)
JumpdestOpAnalysis/PUSH28-8    5.30GB/s ± 5%  5.32GB/s ± 2%     ~     (p=0.350 n=30+20)
JumpdestOpAnalysis/PUSH29-8    5.43GB/s ± 5%  5.37GB/s ± 3%     ~     (p=0.073 n=30+20)
JumpdestOpAnalysis/PUSH30-8    5.49GB/s ± 7%  5.47GB/s ± 4%     ~     (p=0.603 n=30+20)
JumpdestOpAnalysis/PUSH31-8    5.44GB/s ± 4%  5.50GB/s ± 4%     ~     (p=0.204 n=30+20)
JumpdestOpAnalysis/PUSH32-8    6.50GB/s ± 5%  6.27GB/s ± 6%   -3.44%  (p=0.016 n=30+20)
JumpdestOpAnalysis/JUMPDEST-8  1.65GB/s ± 4%  1.90GB/s ±15%  +14.63%  (p=0.000 n=30+20)
JumpdestOpAnalysis/STOP-8      1.66GB/s ± 4%  1.90GB/s ±15%  +14.53%  (p=0.011 n=30+20)
[Geo mean]                     3.53GB/s       3.64GB/s        +3.05%

@chfast chfast marked this pull request as ready for review December 16, 2021 11:58
@holiman
Copy link
Contributor

holiman commented Dec 17, 2021

On my laptop (master vs this pr - both commits)

name                           old speed      new speed      delta
JumpdestOpAnalysis/PUSH1-6     1.03GB/s ± 4%  1.07GB/s ±10%     ~     (p=0.063 n=10+10)
JumpdestOpAnalysis/PUSH2-6     1.15GB/s ± 5%  1.20GB/s ± 4%   +4.11%  (p=0.028 n=10+9)
JumpdestOpAnalysis/PUSH3-6     1.40GB/s ± 6%  1.32GB/s ± 7%   -5.78%  (p=0.004 n=10+10)
JumpdestOpAnalysis/PUSH4-6     1.77GB/s ± 5%  1.86GB/s ± 2%   +5.24%  (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH5-6     2.16GB/s ± 7%  2.23GB/s ± 4%   +3.54%  (p=0.019 n=10+10)
JumpdestOpAnalysis/PUSH6-6     2.36GB/s ± 6%  2.52GB/s ± 5%   +6.82%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH7-6     2.54GB/s ± 7%  2.67GB/s ± 4%   +5.28%  (p=0.001 n=10+10)
JumpdestOpAnalysis/PUSH8-6     1.99GB/s ± 2%  2.57GB/s ± 5%  +29.06%  (p=0.000 n=9+10)
JumpdestOpAnalysis/PUSH9-6     2.36GB/s ± 7%  2.59GB/s ± 6%   +9.69%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH10-6    2.41GB/s ± 6%  2.41GB/s ± 1%     ~     (p=0.460 n=10+8)
JumpdestOpAnalysis/PUSH11-6    2.22GB/s ± 6%  2.49GB/s ± 9%  +12.04%  (p=0.000 n=10+10)
JumpdestOpAnalysis/PUSH12-6    2.60GB/s ± 5%  2.81GB/s ± 3%   +8.15%  (p=0.000 n=10+9)
JumpdestOpAnalysis/PUSH13-6    2.83GB/s ± 4%  2.82GB/s ± 5%     ~     (p=1.000 n=9+10)
JumpdestOpAnalysis/PUSH14-6    2.86GB/s ± 8%  3.08GB/s ± 1%   +7.54%  (p=0.000 n=10+8)
JumpdestOpAnalysis/PUSH15-6    2.92GB/s ± 5%  3.15GB/s ± 4%   +7.73%  (p=0.000 n=9+9)
JumpdestOpAnalysis/PUSH16-6    4.05GB/s ± 5%  4.03GB/s ± 6%     ~     (p=0.529 n=10+10)
JumpdestOpAnalysis/PUSH17-6    4.14GB/s ± 6%  4.11GB/s ± 1%     ~     (p=0.897 n=10+8)
JumpdestOpAnalysis/PUSH18-6    3.81GB/s ± 5%  3.74GB/s ± 6%     ~     (p=0.661 n=10+9)
JumpdestOpAnalysis/PUSH19-6    3.64GB/s ± 7%  3.26GB/s ±32%     ~     (p=0.105 n=10+10)
JumpdestOpAnalysis/PUSH20-6    3.90GB/s ± 5%  3.93GB/s ± 9%     ~     (p=0.400 n=9+10)
JumpdestOpAnalysis/PUSH21-6    4.03GB/s ± 8%  3.88GB/s ± 7%   -3.53%  (p=0.043 n=10+9)
JumpdestOpAnalysis/PUSH22-6    4.25GB/s ± 2%  4.14GB/s ±10%     ~     (p=0.546 n=9+9)
JumpdestOpAnalysis/PUSH23-6    4.25GB/s ± 3%  4.10GB/s ± 7%   -3.53%  (p=0.006 n=8+10)
JumpdestOpAnalysis/PUSH24-6    4.52GB/s ± 4%  4.46GB/s ± 8%     ~     (p=0.604 n=9+10)
JumpdestOpAnalysis/PUSH25-6    4.29GB/s ± 7%  4.31GB/s ± 9%     ~     (p=0.684 n=10+10)
JumpdestOpAnalysis/PUSH26-6    4.11GB/s ± 5%  4.07GB/s ±11%     ~     (p=0.853 n=10+10)
JumpdestOpAnalysis/PUSH27-6    4.04GB/s ± 6%  4.06GB/s ± 8%     ~     (p=0.796 n=10+10)
JumpdestOpAnalysis/PUSH28-6    4.27GB/s ± 9%  4.21GB/s ± 3%     ~     (p=0.143 n=10+10)
JumpdestOpAnalysis/PUSH29-6    4.37GB/s ± 4%  4.35GB/s ± 9%     ~     (p=1.000 n=10+10)
JumpdestOpAnalysis/PUSH30-6    4.38GB/s ± 5%  4.42GB/s ± 5%     ~     (p=0.684 n=10+10)
JumpdestOpAnalysis/PUSH31-6    4.38GB/s ± 3%  4.46GB/s ± 6%     ~     (p=0.165 n=10+10)
JumpdestOpAnalysis/PUSH32-6    5.15GB/s ± 1%  5.15GB/s ± 7%     ~     (p=0.958 n=6+10)
JumpdestOpAnalysis/JUMPDEST-6  1.36GB/s ± 5%  1.55GB/s ± 5%  +14.30%  (p=0.000 n=10+10)
JumpdestOpAnalysis/STOP-6      1.37GB/s ± 4%  1.56GB/s ± 5%  +13.45%  (p=0.000 n=9+10)

So in summary: the worst cases were improved or not changed. LGTM

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@holiman holiman added this to the 1.10.14 milestone Dec 17, 2021
@holiman holiman merged commit 81ec6b1 into ethereum:master Dec 17, 2021
@holiman holiman deleted the analysis_reversed branch December 17, 2021 09:32
sidhujag pushed a commit to syscoin/go-ethereum that referenced this pull request Dec 19, 2021
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
JacekGlen pushed a commit to JacekGlen/go-ethereum that referenced this pull request May 26, 2022
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 23, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 23, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 23, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 26, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 27, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 27, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Feb 27, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Mar 1, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
gzliudan pushed a commit to gzliudan/XDPoSChain that referenced this pull request Mar 1, 2024
* core/vm: reverse bit order in bytes of code bitmap

This bit order is more natural for bit manipulation operations and we
can eliminate some small number of CPU instructions.

* core/vm: drop lookup table
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants