investigate loop unrolling #317

aspic-fish · 2023-09-20T19:28:52Z

Recently I've learned that all Intel processors since Sandy Bridge can do two _mm_loadu_si128 at the same time with port 2 and 3. So I tried 2 sequential _mm_loadu_si128 and it was a success. Then I also tried 4 and 8. 4 gave me an additional boost, but not 8.
Alas, when I pulled upstream commits, I got a significant performance penalty for the esperanto file with msvc. So I dropped them and started adding one by one. And I found which one causes it.
7761599 SSE UTF16 => latin1 (#311)
It seems there's nothing special here. It just added a new dependency with 2 other sse implementations.
So I also checked with gcc and there was no penalty.
Could it be a msvc bug?

inlined version

======================================================================

"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch: Sandy Bridge

======================================================================

windows 10

msvc VS 17.5.5

	esperanto	french	german	portuguese
master branch	12.641 GB/s	3.714 GB/s	5.845 GB/s	4.246 GB/s
current branch before the commit	16.111 GB/s	4.638 GB/s	8.743 GB/s	5.972 GB/s
current branch after the commit	12.450 GB/s	4.522 GB/s	8.237 GB/s	5.869 GB/s

msvc VS 17.7.4

	esperanto	french	german	portuguese
master branch	13.043 GB/s	3.670 GB/s	6.059 GB/s	4.233 GB/s
current branch before the commit	13.927 GB/s	4.546 GB/s	8.518 GB/s	5.895 GB/s
current branch after the commit	12.450 GB/s	4.570 GB/s	8.667 GB/s	5.986 GB/s

LLVM(clang-cl) 16.0.5

	esperanto	french	german	portuguese
master branch	11.738 GB/s	3.648 GB/s	6.535 GB/s	4.462 GB/s
current branch before the commit	6.178 GB/s	3.015 GB/s	4.625 GB/s	3.648 GB/s
current branch after the commit	6.178 GB/s	3.019 GB/s	4.657 GB/s	3.657 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

	esperanto	french	german	portuguese
master branch	11.738 GB/s	3.660 GB/s	6.826 GB/s	4.477 GB/s
current branch before the commit	16.111 GB/s	4.986 GB/s	9.144 GB/s	6.693 GB/s
current branch after the commit	16.111 GB/s	4.980 GB/s	9.102 GB/s	6.710 GB/s

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

build error.

======================================================================

wsl2 ubuntu 22.04

gcc 11.4.0

	esperanto	french	german	portuguese
master branch	12.839 GB/s	3.717 GB/s	6.921 GB/s	4.583 GB/s
current branch before the commit	14.673 GB/s	4.384 GB/s	8.704 GB/s	5.920 GB/s
current branch after the commit	14.673 GB/s	4.389 GB/s	8.667 GB/s	5.920 GB/s

clang 14.0.0-1ubuntu1.1

	esperanto	french	german	portuguese
master branch	11.104 GB/s	3.624 GB/s	6.622 GB/s	4.376 GB/s
current branch before the commit	5.954 GB/s	3.006 GB/s	4.646 GB/s	3.652 GB/s
current branch after the commit	5.911 GB/s	3.010 GB/s	4.657 GB/s	3.662 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

	esperanto	french	german	portuguese
master branch	11.256 GB/s	3.743 GB/s	6.780 GB/s	4.590 GB/s
current branch before the commit	13.927 GB/s	4.963 GB/s	9.315 GB/s	7.170 GB/s
current branch after the commit	13.695 GB/s	4.975 GB/s	9.358 GB/s	7.246 GB/s

The situation got even funnier when I removed all the loops except this one, and got the opposite result. And it's quite consistent between benchmarks.
msvc VS 17.5.5

	esperanto	french	german	portuguese
msvc
before the commit	14.673 GB/s	4.102 GB/s	7.144 GB/s	4.896 GB/s
after the commit	15.503 GB/s	4.205 GB/s	7.608 GB/s	5.108 GB/s

macros version

======================================================================

"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch: Sandy Bridge

======================================================================

windows 10

msvc VS 17.7.4

	esperanto	french	german	portuguese
master branch	13.043 GB/s	3.670 GB/s	6.059 GB/s	4.233 GB/s
current branch before the commit	13.695 GB/s	4.546 GB/s	8.555 GB/s	5.844 GB/s
current branch after the commit	13.043 GB/s	4.546 GB/s	8.446 GB/s	5.895 GB/s

LLVM(clang-cl) 16.0.5

	esperanto	french	german	portuguese
master branch	11.738 GB/s	3.648 GB/s	6.535 GB/s	4.462 GB/s
current branch before the commit	17.118 GB/s	5.253 GB/s	9.537 GB/s	7.548 GB/s
current branch after the commit	17.118 GB/s	5.259 GB/s	9.402 GB/s	7.548 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

	esperanto	french	german	portuguese
master branch	11.738 GB/s	3.660 GB/s	6.826 GB/s	4.477 GB/s
current branch before the commit	17.118 GB/s	5.208 GB/s	9.402 GB/s	7.486 GB/s
current branch after the commit	17.118 GB/s	5.196 GB/s	9.228 GB/s	7.465 GB/s

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

build error.

======================================================================

wsl2 ubuntu 22.04

gcc 11.4.0

	esperanto	french	german	portuguese
master branch	12.839 GB/s	3.717 GB/s	6.921 GB/s	4.583 GB/s
current branch before the commit	14.940 GB/s	4.471 GB/s	8.704 GB/s	5.895 GB/s
current branch after the commit	14.673 GB/s	4.570 GB/s	8.743 GB/s	6.012 GB/s

clang 14.0.0-1ubuntu1.1

	esperanto	french	german	portuguese
master branch	11.104 GB/s	3.624 GB/s	6.622 GB/s	4.376 GB/s
current branch before the commit	14.673 GB/s	4.929 GB/s	9.492 GB/s	7.022 GB/s
current branch after the commit	14.415 GB/s	4.924 GB/s	9.492 GB/s	7.040 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

	esperanto	french	german	portuguese
master branch	11.256 GB/s	3.743 GB/s	6.780 GB/s	4.590 GB/s
current branch before the commit	14.167 GB/s	5.110 GB/s	9.676 GB/s	7.445 GB/s
current branch after the commit	14.167 GB/s	5.092 GB/s	9.771 GB/s	7.486 GB/s

I'm going to continue the investigation in a couple of days.
plan:
*I suspect that building it as a shared lib might help as it would prevent access of msvc to the rest of the code.
Supposedly, that wouldn't allow it to perform some smart optimisations and thus results should be more stable.

try more compilers.
check how adding/removing other sse implementations affects performance
try unrolling other implementations as well

~~For now, I suggest considering unrolling as unstable.~~

aqrit · 2023-09-20T20:59:54Z

Could you add some clarity?
The PR is for sse_convert_latin1_to_utf8.
But you're discussing a performance regression with sse_convert_utf16_to_latin1 ?

This PR :
I suspect the branch for 'ASCII fast path' will interfere with unrolling attempts.

The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port.
(I imagine there are 6 shuffles per 16 bytes of input)

The CPU can have 20+ loads "in-flight" at once. The total loads issued per cycle is not that interesting for unrolling (we always have the same number of loads). It is mostly for scheduling instructions at the assembly level, and to ballpark whether we're IO bound (e.g. do we want to replace a lookup with a calculation, are we spilling and reloading register, etc.).

aspic-fish · 2023-09-20T21:36:04Z

Tables contain results for sse_convert_latin1_to_utf8 only. But somehow 7761599 commit causes sse_convert_latin1_to_utf8 regression.
I just go to the brunch sse_convert_latin1_to_utf8_perf, rebuild all and run the benchmark. Results are placed in the "before the commit" row.
Then I just do git rebase 7761599262953df2a1d9c3427d0d27d1cb044615, rebuild all and run the benchmark again. Results are placed in the "after the commit" row.

aspic-fish · 2023-09-21T01:22:00Z

@aqrit,

The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port.
(I imagine there are 6 shuffles per 16 bytes of input)

according to uops from Nehalem to Ivy Bridge there are 2 ports for shuffle.
And I use 2 _mm_shuffle_epi8 per _mm_load_si128 to split input into 2 vectors. They are pipelined and should go in 1 cycle both.
And 2 more to pack vectors into utf-8 just after lookup. +1 cycle per shuffle
I don't think shuffle is a candidate for a bottleneck. But LUT surely is. Google says L1 cache hit takes 4 cycles in best case.
There're 2 lookups per load, so at least 8 cycles or many more in case of a cache miss.

But _mm_load_si128 is a bottleneck too. A smaller one. It's latency is 6. My original thought was that it actually might not be pipelined automatically. In this case, 2 sequential calls of it would give 32 bytes instead of 16 for the same latency.

The CPU can have 20+ loads "in-flight" at once.

but x86-64 has only 16 sse registers, how could it be 20+?
Could you provide keywords for googling?

PS: I'm actually new to all these simd stuff so excuse me if I say something stupid)

lemire · 2023-09-21T13:59:09Z

x86-64 has only 16 sse registers, how could it be 20+?

These are named registers, but we have many more registers.

You can examine the issue experimentally...
https://lemire.me/blog/2022/06/07/memory-level-parallelism-intel-ice-lake-versus-amazon-graviton-3/

aqrit · 2023-09-21T20:29:28Z

A Nehalem would not have a problem with the shuffles. However, a Haswell or Skylake might. On what CPU are you performing the benchmark?

The terms would be register renaming and out-of-order execution, I guess.

uops.info Code Analyzer is probably a good place to start, if we want to micro-optimize this.

lemire · 2023-09-21T21:20:03Z

@aqrit A fun one is this PR: #318

The westmere kernel (which is currently just scalar code, but subject to autovectorization) is faster than a reasonable hand-coded AVX2 routine. It is still fine because the differences are small... but it is clear that we could micro-optimize better.

aspic-fish · 2023-09-23T00:39:35Z

I think I get it roughly. But I struggle to draw parallels with actual code in general. But fortunately, for this PR I'm not into tuning this exact implementation but rather introducing the msvc issue and whether it's present in other compilers as well as unrolling effect with different compilers.

I additionally checked clang and icx, both doesn't have this issue at least for this pr. But the results also showed that unrolling is not necessarily consistent between compilers for the same os. clang showed performance degradation for windows and ubuntu, unlike the rest. So, if we use unrolling, we have to choose how to handle such situations.

Tomorrow I'll try to found out whether building separately helps with msvc.
P.S. I got a build error with mingw, should I open an issue?

aspic-fish · 2023-09-24T04:41:27Z

It doesn't. But I found that clang can actually benefit from unrolling. Just don't use inlined functions in loop body. For some reason they drop performance a lot.

UPD: I replaced inlined functions with macros. Now everything looks much better. Unroll factors are likely not the most optimal, there were some better ones during development. I'm gonna write a script to brute force them.

replace inline functions in loop body to macros might also help with msvc

aqrit · 2023-10-01T00:37:30Z

cache miss

If we're willing to do 4 lookups per 16 bytes of input, then we'd only use 2 cache lines for tables.
https://gist.github.com/aqrit/5c914da98006874d0401983eb687e30e

Note: I haven't actually studied the utf16->utf8 function so I don't what that is doing...

feat: add loop unroll to sse_convert_latin1_to_utf8

71123d4

fix: fix performance penalty for clang

6d828ed

replace inline functions in loop body to macros might also help with msvc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate loop unrolling #317

investigate loop unrolling #317

aspic-fish commented Sep 20, 2023 •

edited

aqrit commented Sep 20, 2023 •

edited

aspic-fish commented Sep 20, 2023

aspic-fish commented Sep 21, 2023

lemire commented Sep 21, 2023

aqrit commented Sep 21, 2023

lemire commented Sep 21, 2023

aspic-fish commented Sep 23, 2023

aspic-fish commented Sep 24, 2023 •

edited

aqrit commented Oct 1, 2023

investigate loop unrolling #317

Are you sure you want to change the base?

investigate loop unrolling #317

Conversation

aspic-fish commented Sep 20, 2023 • edited

windows 10

msvc VS 17.5.5

msvc VS 17.7.4

LLVM(clang-cl) 16.0.5

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

wsl2 ubuntu 22.04

gcc 11.4.0

clang 14.0.0-1ubuntu1.1

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

windows 10

msvc VS 17.7.4

LLVM(clang-cl) 16.0.5

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

wsl2 ubuntu 22.04

gcc 11.4.0

clang 14.0.0-1ubuntu1.1

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

aqrit commented Sep 20, 2023 • edited

aspic-fish commented Sep 20, 2023

aspic-fish commented Sep 21, 2023

lemire commented Sep 21, 2023

aqrit commented Sep 21, 2023

lemire commented Sep 21, 2023

aspic-fish commented Sep 23, 2023

aspic-fish commented Sep 24, 2023 • edited

aqrit commented Oct 1, 2023

aspic-fish commented Sep 20, 2023 •

edited

aqrit commented Sep 20, 2023 •

edited

aspic-fish commented Sep 24, 2023 •

edited