Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve branch prediction #2140

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Improve branch prediction #2140

wants to merge 1 commit into from

Conversation

saleyn
Copy link
Contributor

@saleyn saleyn commented Feb 25, 2024

This PR adds some unlikely compiler hints to improve branch predition.
For some reason in cases when SIMDJSON_TRY is used the compiler complatined about uninitialized variables, so explicit initialization was added.

The performance results:

  1. master branch:
$ ./build/benchmark/dom/parse jsonexamples/twitter.json
number of iterations 200

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  25.0304 ns per block ( 94.64%) -   0.3911 ns per byte -   4.4696 ns per structural -   2.5567 GB/s
|    Cycles       : 100.0264 per block    ( 95.21%) -   1.5630 per byte    -  17.8615 per structural    -    3.996 GHz est. frequency
|    Instructions : 288.1751 per block    (100.00%) -   4.5030 per byte    -  51.4587 per structural    -    2.881 per cycle
|    Misses       :    5267 branch misses ( 96.13%) - 37 cache misses (  2.83%) - 83909.00 cache references
|- Stage 1
|    Speed        :  11.9173 ns per block ( 45.06%) -   0.1862 ns per byte -   2.1280 ns per structural -   5.3700 GB/s
|    Cycles       :  47.6119 per block    ( 45.32%) -   0.7440 per byte    -   8.5019 per structural    -    3.995 GHz est. frequency
|    Instructions : 140.0335 per block    ( 48.59%) -   2.1882 per byte    -  25.0054 per structural    -    2.941 per cycle
|    Misses       :    3298 branch misses ( 60.19%) - 33 cache misses (  2.52%) - 29513.00 cache references
|- Stage 2
|    Speed        :  13.1131 ns per block ( 49.58%) -   0.2049 ns per byte -   2.3416 ns per structural -   4.8803 GB/s
|    Cycles       :  52.4146 per block    ( 49.89%) -   0.8190 per byte    -   9.3595 per structural    -    3.997 GHz est. frequency
|    Instructions : 148.1416 per block    ( 51.41%) -   2.3148 per byte    -  26.4533 per structural    -    2.826 per cycle
|    Misses       :    1969 branch misses ( 35.94%) - 4 cache misses (  0.31%) - 54396.00 cache references

4048.6 documents parsed per second (best)
  1. This PR:
$ ./build/benchmark/dom/parse jsonexamples/twitter.json
number of iterations 200

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  23.6726 ns per block ( 77.56%) -   0.3699 ns per byte -   4.2272 ns per structural -   2.7034 GB/s
|    Cycles       :  99.4666 per block    ( 88.68%) -   1.5543 per byte    -  17.7615 per structural    -    4.202 GHz est. frequency
|    Instructions : 286.1744 per block    (100.00%) -   4.4717 per byte    -  51.1015 per structural    -    2.877 per cycle
|    Misses       :    4955 branch misses ( 84.24%) - 47 cache misses (  1.25%) - 78865.00 cache references
|- Stage 1
|    Speed        :  11.4512 ns per block ( 37.52%) -   0.1789 ns per byte -   2.0448 ns per structural -   5.5886 GB/s
|    Cycles       :  48.0296 per block    ( 42.82%) -   0.7505 per byte    -   8.5765 per structural    -    4.194 GHz est. frequency
|    Instructions : 140.0335 per block    ( 48.93%) -   2.1882 per byte    -  25.0054 per structural    -    2.916 per cycle
|    Misses       :    3086 branch misses ( 52.46%) - 3 cache misses (  0.08%) - 27509.00 cache references
|- Stage 2
|    Speed        :  12.2011 ns per block ( 39.97%) -   0.1907 ns per byte -   2.1787 ns per structural -   5.2451 GB/s
|    Cycles       :  51.2163 per block    ( 45.66%) -   0.8003 per byte    -   9.1456 per structural    -    4.198 GHz est. frequency
|    Instructions : 146.1409 per block    ( 51.07%) -   2.2836 per byte    -  26.0960 per structural    -    2.853 per cycle
|    Misses       :    1951 branch misses ( 33.17%) - 14 cache misses (  0.37%) - 51053.00 cache references

4280.8 documents parsed per second (best)

@saleyn saleyn force-pushed the unlikely branch 4 times, most recently from 10bc5bb to 9f45a68 Compare February 25, 2024 05:58
@lemire
Copy link
Member

lemire commented Feb 25, 2024

Sounds promising. I will review shortly.

@saleyn
Copy link
Contributor Author

saleyn commented Feb 25, 2024

Maybe you'll have an idea why with adding these unlikely checks, without explicit initializations the compiler thinks that those variables maybe uninitialized. I am having hard time to figure out the reason.

@lemire
Copy link
Member

lemire commented Feb 25, 2024

@saleyn My interpretation is that the compiler fails to convince itself that the variable is initialized. I would happily ignore this static analysis, but then we will get endless "bug reports" about it, so we better silence these warnings ourselves.

@lemire
Copy link
Member

lemire commented Feb 25, 2024

Silencing a warning in GCC can be done without changing the code, although that can be more trouble than it is worth.

@saleyn
Copy link
Contributor Author

saleyn commented Feb 25, 2024

I just pushed another version that places initialization only inside the get() functions. Hopefully this solves the issue, albeit, I don't understand why adding the unlikely check all of a sudden results in failure to detect initialization in the original code.

@lemire
Copy link
Member

lemire commented Feb 25, 2024

@saleyn We can manually silence the maybe-uninitialized warnings/errors, but as I wrote, that's not always trivial.

@saleyn
Copy link
Contributor Author

saleyn commented Feb 26, 2024

I am travelling till mid March, and will unlikely be able to look into this. I think the issue with failing actions is that somewhere else I missed to initialize the argument in the get(... arg) function. Maybe if you have time to correct this the actions would pass?

@lemire
Copy link
Member

lemire commented Feb 27, 2024

On an x64 processor (Ice Lake) and GCC 12, I find that this PR is maybe slightly worse on bench_ondemand results, but the difference is so small that I would not say it is conclusive. It is more likely that the difference is too small and I cannot measure it accurately. I also tried on macOS, and the results were similarly not conclusive.

Your results appear to indicate better performance for this PR in the DOM tests, but your clock speed is not the same. You do report fewer instructions, which is a good thing.

Let me run DOM tests on this system... (x64/Ice Lake) and GCC 12.

This PR (repeated 3 times)

$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.8499 ns per block ( 87.92%) -   0.2945 ns per byte -   3.3660 ns per structural -   3.3950 GB/s
|    Cycles       :  58.3621 per block    ( 98.17%) -   0.9120 per byte    -  10.4216 per structural    -    3.096 GHz est. frequency
|    Instructions : 206.4364 per block    (100.00%) -   3.2258 per byte    -  36.8628 per structural    -    3.537 per cycle
|    Misses       :     484 branch misses ( 97.91%) - 1 cache misses ( 50.53%) - 26604.00 cache references
|- Stage 1
|    Speed        :   6.4073 ns per block ( 29.88%) -   0.1001 ns per byte -   1.1441 ns per structural -   9.9881 GB/s
|    Cycles       :  19.8528 per block    ( 33.39%) -   0.3102 per byte    -   3.5451 per structural    -    3.098 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.85%) -   1.0921 per byte    -  12.4798 per structural    -    3.520 per cycle
|    Misses       :      93 branch misses ( 18.81%) - 0 cache misses (  0.00%) - 11215.00 cache references
|- Stage 2
|    Speed        :  12.4110 ns per block ( 57.89%) -   0.1939 ns per byte -   2.2162 ns per structural -   5.1564 GB/s
|    Cycles       :  38.4054 per block    ( 64.60%) -   0.6001 per byte    -   6.8579 per structural    -    3.094 GHz est. frequency
|    Instructions : 136.5482 per block    ( 66.15%) -   2.1337 per byte    -  24.3831 per structural    -    3.555 per cycle
|    Misses       :     385 branch misses ( 77.88%) - 1 cache misses ( 50.53%) - 15451.00 cache references

5376.0 documents parsed per second (best)

$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.7682 ns per block ( 88.68%) -   0.2933 ns per byte -   3.3514 ns per structural -   3.4098 GB/s
|    Cycles       :  58.1223 per block    ( 98.25%) -   0.9082 per byte    -  10.3788 per structural    -    3.097 GHz est. frequency
|    Instructions : 206.4364 per block    (100.00%) -   3.2258 per byte    -  36.8628 per structural    -    3.552 per cycle
|    Misses       :     468 branch misses ( 95.76%) - 0 cache misses (  0.00%) - 26693.00 cache references
|- Stage 1
|    Speed        :   6.4103 ns per block ( 30.29%) -   0.1002 ns per byte -   1.1447 ns per structural -   9.9833 GB/s
|    Cycles       :  19.8643 per block    ( 33.58%) -   0.3104 per byte    -   3.5471 per structural    -    3.099 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.85%) -   1.0921 per byte    -  12.4798 per structural    -    3.518 per cycle
|    Misses       :      96 branch misses ( 19.64%) - 0 cache misses (  0.00%) - 11551.00 cache references
|- Stage 2
|    Speed        :  12.3142 ns per block ( 58.19%) -   0.1924 ns per byte -   2.1989 ns per structural -   5.1969 GB/s
|    Cycles       :  38.1207 per block    ( 64.44%) -   0.5957 per byte    -   6.8071 per structural    -    3.096 GHz est. frequency
|    Instructions : 136.5482 per block    ( 66.15%) -   2.1337 per byte    -  24.3831 per structural    -    3.582 per cycle
|    Misses       :     374 branch misses ( 76.53%) - 0 cache misses (  0.00%) - 15159.00 cache references

5399.4 documents parsed per second (best)

$ ./build/benchmark/dom/parsepr jsonexamples/twitter.json -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.7989 ns per block ( 87.33%) -   0.2938 ns per byte -   3.3569 ns per structural -   3.4042 GB/s
|    Cycles       :  58.2127 per block    ( 98.10%) -   0.9096 per byte    -  10.3949 per structural    -    3.097 GHz est. frequency
|    Instructions : 206.4364 per block    (100.00%) -   3.2258 per byte    -  36.8628 per structural    -    3.546 per cycle
|    Misses       :     468 branch misses ( 94.60%) - 0 cache misses (  0.00%) - 27888.00 cache references
|- Stage 1
|    Speed        :   6.4219 ns per block ( 29.83%) -   0.1003 ns per byte -   1.1467 ns per structural -   9.9654 GB/s
|    Cycles       :  19.8962 per block    ( 33.53%) -   0.3109 per byte    -   3.5528 per structural    -    3.098 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.85%) -   1.0921 per byte    -  12.4798 per structural    -    3.513 per cycle
|    Misses       :      98 branch misses ( 19.81%) - 0 cache misses (  0.00%) - 11224.00 cache references
|- Stage 2
|    Speed        :  12.3586 ns per block ( 57.41%) -   0.1931 ns per byte -   2.2069 ns per structural -   5.1783 GB/s
|    Cycles       :  38.2584 per block    ( 64.47%) -   0.5978 per byte    -   6.8317 per structural    -    3.096 GHz est. frequency
|    Instructions : 136.5482 per block    ( 66.15%) -   2.1337 per byte    -  24.3831 per structural    -    3.569 per cycle
|    Misses       :     371 branch misses ( 74.99%) - 0 cache misses (  0.00%) - 17065.00 cache references

5390.6 documents parsed per second (best)

Main branch (repeated 3 times)

$ ./build/benchmark/dom/parse jsonexamples/twitter.json  -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.6728 ns per block ( 89.77%) -   0.2918 ns per byte -   3.3344 ns per structural -   3.4272 GB/s
|    Cycles       :  57.8292 per block    ( 98.45%) -   0.9036 per byte    -  10.3264 per structural    -    3.097 GHz est. frequency
|    Instructions : 207.5327 per block    (100.00%) -   3.2429 per byte    -  37.0586 per structural    -    3.589 per cycle
|    Misses       :     462 branch misses ( 97.47%) - 8 cache misses ( 70.45%) - 26667.00 cache references
|- Stage 1
|    Speed        :   6.3749 ns per block ( 30.65%) -   0.0996 ns per byte -   1.1384 ns per structural -  10.0387 GB/s
|    Cycles       :  19.7597 per block    ( 33.64%) -   0.3088 per byte    -   3.5284 per structural    -    3.100 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.68%) -   1.0921 per byte    -  12.4798 per structural    -    3.537 per cycle
|    Misses       :      84 branch misses ( 17.72%) - 5 cache misses ( 44.03%) - 11294.00 cache references
|- Stage 2
|    Speed        :  12.2561 ns per block ( 58.92%) -   0.1915 ns per byte -   2.1885 ns per structural -   5.2216 GB/s
|    Cycles       :  37.9380 per block    ( 64.59%) -   0.5928 per byte    -   6.7745 per structural    -    3.095 GHz est. frequency
|    Instructions : 137.6446 per block    ( 66.32%) -   2.1508 per byte    -  24.5789 per structural    -    3.628 per cycle
|    Misses       :     368 branch misses ( 77.64%) - 4 cache misses ( 35.22%) - 15199.00 cache references

5427.0 documents parsed per second (best)

$ ./build/benchmark/dom/parse jsonexamples/twitter.json  -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.9860 ns per block ( 87.36%) -   0.2967 ns per byte -   3.3903 ns per structural -   3.3707 GB/s
|    Cycles       :  58.7981 per block    ( 98.28%) -   0.9188 per byte    -  10.4994 per structural    -    3.097 GHz est. frequency
|    Instructions : 207.5327 per block    (100.00%) -   3.2429 per byte    -  37.0586 per structural    -    3.530 per cycle
|    Misses       :     459 branch misses ( 96.12%) - 4 cache misses ( 17.35%) - 27450.00 cache references
|- Stage 1
|    Speed        :   6.4357 ns per block ( 29.61%) -   0.1006 ns per byte -   1.1492 ns per structural -   9.9440 GB/s
|    Cycles       :  19.9421 per block    ( 33.33%) -   0.3116 per byte    -   3.5610 per structural    -    3.099 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.68%) -   1.0921 per byte    -  12.4798 per structural    -    3.505 per cycle
|    Misses       :      93 branch misses ( 19.48%) - 0 cache misses (  0.00%) - 11112.00 cache references
|- Stage 2
|    Speed        :  12.5173 ns per block ( 57.60%) -   0.1956 ns per byte -   2.2352 ns per structural -   5.1126 GB/s
|    Cycles       :  38.7468 per block    ( 64.77%) -   0.6055 per byte    -   6.9189 per structural    -    3.095 GHz est. frequency
|    Instructions : 137.6446 per block    ( 66.32%) -   2.1508 per byte    -  24.5789 per structural    -    3.552 per cycle
|    Misses       :     361 branch misses ( 75.60%) - 4 cache misses ( 17.35%) - 16219.00 cache references

5337.5 documents parsed per second (best)

$ ./build/benchmark/dom/parse jsonexamples/twitter.json  -n 4000
number of iterations 4000

jsonexamples/twitter.json
=========================
     9867 blocks -     631515 bytes - 55262 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages (excluding allocation)
|    Speed        :  18.6361 ns per block ( 88.83%) -   0.2912 ns per byte -   3.3278 ns per structural -   3.4340 GB/s
|    Cycles       :  57.7138 per block    ( 98.46%) -   0.9018 per byte    -  10.3058 per structural    -    3.097 GHz est. frequency
|    Instructions : 207.5327 per block    (100.00%) -   3.2429 per byte    -  37.0586 per structural    -    3.596 per cycle
|    Misses       :     475 branch misses ( 95.89%) - 0 cache misses (  0.00%) - 26636.00 cache references
|- Stage 1
|    Speed        :   6.3687 ns per block ( 30.36%) -   0.0995 ns per byte -   1.1372 ns per structural -  10.0486 GB/s
|    Cycles       :  19.7399 per block    ( 33.68%) -   0.3085 per byte    -   3.5249 per structural    -    3.100 GHz est. frequency
|    Instructions :  69.8881 per block    ( 33.68%) -   1.0921 per byte    -  12.4798 per structural    -    3.540 per cycle
|    Misses       :      99 branch misses ( 19.99%) - 0 cache misses (  0.00%) - 11500.00 cache references
|- Stage 2
|    Speed        :  12.2440 ns per block ( 58.36%) -   0.1913 ns per byte -   2.1864 ns per structural -   5.2267 GB/s
|    Cycles       :  37.9011 per block    ( 64.66%) -   0.5922 per byte    -   6.7679 per structural    -    3.095 GHz est. frequency
|    Instructions : 137.6446 per block    ( 66.32%) -   2.1508 per byte    -  24.5789 per structural    -    3.632 per cycle
|    Misses       :     368 branch misses ( 74.29%) - 0 cache misses (  0.00%) - 15271.00 cache references

5437.7 documents parsed per second (best)

Analysis (DOM)

Just look at stage 2 (DOM), I get that the main branch speed is in the interval 5.11GB/s to 5.23 GB/s. The PR is in the interval 5.16 GB/s to 5.20 GB/s... using three trials. So this PR is anywhere between 2% better or 1% worse.

When in doubt, I tend to go with the instruction count and the PR seems to reduce it by 0.8%.

So I tend to believe that this PR is a win, although a small win.

kostya ondemand (this PR)

kostya<simdjson_ondemand>/manual_time   50436659 ns     61317425 ns           14 best_branch_miss=320.223k best_bytes_per_sec=2.73101G best_cache_miss=1.47762M best_cache_ref=6.24866M best_cycles=158.335M best_cycles_per_byte=1.15316 best_docs_per_sec=19.8901 best_frequency=3.1493G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.24766 best_items_per_sec=10.4281M branch_miss=321.055k bytes=137.305M bytes_per_second=2.53536G/s cache_miss=1.48088M cache_ref=6.24934M cycles=158.806M cycles_per_byte=1.15659 docs_per_sec=19.8268/s frequency=3.14863G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.23803 items=524.288k items_per_second=10.395M/s [BEST: throughput=  2.73 GB/s doc_throughput=    19 docs/s instructions=   514219519 cycles=   158335209 branch_miss=  320223 cache_miss= 1477619 cache_ref=   6248662 items=    524288 avg_time=  50436659 ns]
kostya<simdjson_ondemand>/manual_time   50520943 ns     61601632 ns           14 best_branch_miss=317k best_bytes_per_sec=2.72754G best_cache_miss=1.50372M best_cache_ref=6.33852M best_cycles=158.467M best_cycles_per_byte=1.15412 best_docs_per_sec=19.8649 best_frequency=3.14792G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.24496 best_items_per_sec=10.4149M branch_miss=318.11k bytes=137.305M bytes_per_second=2.53113G/s cache_miss=1.50425M cache_ref=6.3387M cycles=159.002M cycles_per_byte=1.15802 docs_per_sec=19.7938/s frequency=3.14725G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.23404 items=524.288k items_per_second=10.3776M/s [BEST: throughput=  2.73 GB/s doc_throughput=    19 docs/s instructions=   514219518 cycles=   158466991 branch_miss=  317000 cache_miss= 1503725 cache_ref=   6338515 items=    524288 avg_time=  50520942 ns]
kostya<simdjson_ondemand>/manual_time   50732221 ns     61651891 ns           14 best_branch_miss=317.674k best_bytes_per_sec=2.71335G best_cache_miss=1.48926M best_cache_ref=6.25722M best_cycles=159.359M best_cycles_per_byte=1.16062 best_docs_per_sec=19.7615 best_frequency=3.14916G best_instructions=514.22M best_instructions_per_byte=3.74509 best_instructions_per_cycle=3.22681 best_items_per_sec=10.3607M branch_miss=317.764k bytes=137.305M bytes_per_second=2.52059G/s cache_miss=1.48845M cache_ref=6.25745M cycles=159.726M cycles_per_byte=1.16329 docs_per_sec=19.7113/s frequency=3.14841G/s instructions=514.22M instructions_per_byte=3.74509 instructions_per_cycle=3.21938 items=524.288k items_per_second=10.3344M/s [BEST: throughput=  2.71 GB/s doc_throughput=    19 docs/s instructions=   514219519 cycles=   159358691 branch_miss=  317674 cache_miss= 1489255 cache_ref=   6257220 items=    524288 avg_time=  50732221 ns]

kostya ondemand (main branch)

kostya<simdjson_ondemand>/manual_time   50533850 ns     61401809 ns           14 best_branch_miss=315.181k best_bytes_per_sec=2.72971G best_cache_miss=1.47801M best_cache_ref=6.25266M best_cycles=158.413M best_cycles_per_byte=1.15373 best_docs_per_sec=19.8806 best_frequency=3.14935G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.2229 best_items_per_sec=10.4232M branch_miss=314.899k bytes=137.305M bytes_per_second=2.53049G/s cache_miss=1.47842M cache_ref=6.25278M cycles=159.101M cycles_per_byte=1.15874 docs_per_sec=19.7887/s frequency=3.14841G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.20896 items=524.288k items_per_second=10.375M/s [BEST: throughput=  2.73 GB/s doc_throughput=    19 docs/s instructions=   510549503 cycles=   158413320 branch_miss=  315181 cache_miss= 1478014 cache_ref=   6252663 items=    524288 avg_time=  50533850 ns]
kostya<simdjson_ondemand>/manual_time   49928598 ns     60663634 ns           14 best_branch_miss=313.717k best_bytes_per_sec=2.76004G best_cache_miss=1.47957M best_cache_ref=6.21029M best_cycles=156.674M best_cycles_per_byte=1.14107 best_docs_per_sec=20.1015 best_frequency=3.14939G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.25867 best_items_per_sec=10.539M branch_miss=314.029k bytes=137.305M bytes_per_second=2.56116G/s cache_miss=1.48521M cache_ref=6.21166M cycles=157.202M cycles_per_byte=1.14491 docs_per_sec=20.0286/s frequency=3.14853G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.24773 items=524.288k items_per_second=10.5008M/s [BEST: throughput=  2.76 GB/s doc_throughput=    20 docs/s instructions=   510549502 cycles=   156674234 branch_miss=  313717 cache_miss= 1479566 cache_ref=   6210292 items=    524288 avg_time=  49928598 ns]
kostya<simdjson_ondemand>/manual_time   50223191 ns     60923453 ns           14 best_branch_miss=313.433k best_bytes_per_sec=2.74307G best_cache_miss=1.48189M best_cache_ref=6.19918M best_cycles=157.645M best_cycles_per_byte=1.14814 best_docs_per_sec=19.978 best_frequency=3.14943G best_instructions=510.55M best_instructions_per_byte=3.71836 best_instructions_per_cycle=3.23859 best_items_per_sec=10.4742M branch_miss=313.453k bytes=137.305M bytes_per_second=2.54614G/s cache_miss=1.48057M cache_ref=6.20006M cycles=158.128M cycles_per_byte=1.15165 docs_per_sec=19.9111/s frequency=3.1485G/s instructions=510.55M instructions_per_byte=3.71836 instructions_per_cycle=3.22871 items=524.288k items_per_second=10.4392M/s [BEST: throughput=  2.74 GB/s doc_throughput=    19 docs/s instructions=   510549502 cycles=   157645430 branch_miss=  313433 cache_miss= 1481890 cache_ref=   6199176 items=    524288 avg_time=  50223190 ns]

Analysis (ondemand)

Focusing solely on one ondemand benchmark (kostya), this PR seems an overall negative.

Recommendation (tentative)

I think that if this PR focused solely on the DOM code, it would easier to consider it. As things stand, I don't have the data to convince myself that there are gains with On Demand benchmarks. It is possible that this PR creates a slight performance regression on On Demand.

Of course, maybe there is a methodological issue.

Risks : I should stress that results may vary depending the exact compiler and processor. We cannot rule out that this PR could be a net negative when using different systems, even with the DOM parser. So some caution is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants