Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate loop unrolling #317

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aspic-fish
Copy link
Contributor

@aspic-fish aspic-fish commented Sep 20, 2023

Recently I've learned that all Intel processors since Sandy Bridge can do two _mm_loadu_si128 at the same time with port 2 and 3. So I tried 2 sequential _mm_loadu_si128 and it was a success. Then I also tried 4 and 8. 4 gave me an additional boost, but not 8.
Alas, when I pulled upstream commits, I got a significant performance penalty for the esperanto file with msvc. So I dropped them and started adding one by one. And I found which one causes it.
7761599 SSE UTF16 => latin1 (#311)
It seems there's nothing special here. It just added a new dependency with 2 other sse implementations.
So I also checked with gcc and there was no penalty.
Could it be a msvc bug?

inlined version ======================================================================

"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch: Sandy Bridge

======================================================================

windows 10

msvc VS 17.5.5

esperanto french german portuguese
master branch 12.641 GB/s 3.714 GB/s 5.845 GB/s 4.246 GB/s
current branch before the commit 16.111 GB/s 4.638 GB/s 8.743 GB/s 5.972 GB/s
current branch after the commit 12.450 GB/s 4.522 GB/s 8.237 GB/s 5.869 GB/s

msvc VS 17.7.4

esperanto french german portuguese
master branch 13.043 GB/s 3.670 GB/s 6.059 GB/s 4.233 GB/s
current branch before the commit 13.927 GB/s 4.546 GB/s 8.518 GB/s 5.895 GB/s
current branch after the commit 12.450 GB/s 4.570 GB/s 8.667 GB/s 5.986 GB/s

LLVM(clang-cl) 16.0.5

esperanto french german portuguese
master branch 11.738 GB/s 3.648 GB/s 6.535 GB/s 4.462 GB/s
current branch before the commit 6.178 GB/s 3.015 GB/s 4.625 GB/s 3.648 GB/s
current branch after the commit 6.178 GB/s 3.019 GB/s 4.657 GB/s 3.657 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

esperanto french german portuguese
master branch 11.738 GB/s 3.660 GB/s 6.826 GB/s 4.477 GB/s
current branch before the commit 16.111 GB/s 4.986 GB/s 9.144 GB/s 6.693 GB/s
current branch after the commit 16.111 GB/s 4.980 GB/s 9.102 GB/s 6.710 GB/s

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

build error.

======================================================================

wsl2 ubuntu 22.04

gcc 11.4.0

esperanto french german portuguese
master branch 12.839 GB/s 3.717 GB/s 6.921 GB/s 4.583 GB/s
current branch before the commit 14.673 GB/s 4.384 GB/s 8.704 GB/s 5.920 GB/s
current branch after the commit 14.673 GB/s 4.389 GB/s 8.667 GB/s 5.920 GB/s

clang 14.0.0-1ubuntu1.1

esperanto french german portuguese
master branch 11.104 GB/s 3.624 GB/s 6.622 GB/s 4.376 GB/s
current branch before the commit 5.954 GB/s 3.006 GB/s 4.646 GB/s 3.652 GB/s
current branch after the commit 5.911 GB/s 3.010 GB/s 4.657 GB/s 3.662 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

esperanto french german portuguese
master branch 11.256 GB/s 3.743 GB/s 6.780 GB/s 4.590 GB/s
current branch before the commit 13.927 GB/s 4.963 GB/s 9.315 GB/s 7.170 GB/s
current branch after the commit 13.695 GB/s 4.975 GB/s 9.358 GB/s 7.246 GB/s

The situation got even funnier when I removed all the loops except this one, and got the opposite result. And it's quite consistent between benchmarks.
msvc VS 17.5.5

esperanto french german portuguese
msvc
before the commit 14.673 GB/s 4.102 GB/s 7.144 GB/s 4.896 GB/s
after the commit 15.503 GB/s 4.205 GB/s 7.608 GB/s 5.108 GB/s
macros version ======================================================================

"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch: Sandy Bridge

======================================================================

windows 10

msvc VS 17.7.4

esperanto french german portuguese
master branch 13.043 GB/s 3.670 GB/s 6.059 GB/s 4.233 GB/s
current branch before the commit 13.695 GB/s 4.546 GB/s 8.555 GB/s 5.844 GB/s
current branch after the commit 13.043 GB/s 4.546 GB/s 8.446 GB/s 5.895 GB/s

LLVM(clang-cl) 16.0.5

esperanto french german portuguese
master branch 11.738 GB/s 3.648 GB/s 6.535 GB/s 4.462 GB/s
current branch before the commit 17.118 GB/s 5.253 GB/s 9.537 GB/s 7.548 GB/s
current branch after the commit 17.118 GB/s 5.259 GB/s 9.402 GB/s 7.548 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)

esperanto french german portuguese
master branch 11.738 GB/s 3.660 GB/s 6.826 GB/s 4.477 GB/s
current branch before the commit 17.118 GB/s 5.208 GB/s 9.402 GB/s 7.486 GB/s
current branch after the commit 17.118 GB/s 5.196 GB/s 9.228 GB/s 7.465 GB/s

mingw-w64-ucrt-x86_64-gcc 13.1.0-7

build error.

======================================================================

wsl2 ubuntu 22.04

gcc 11.4.0

esperanto french german portuguese
master branch 12.839 GB/s 3.717 GB/s 6.921 GB/s 4.583 GB/s
current branch before the commit 14.940 GB/s 4.471 GB/s 8.704 GB/s 5.895 GB/s
current branch after the commit 14.673 GB/s 4.570 GB/s 8.743 GB/s 6.012 GB/s

clang 14.0.0-1ubuntu1.1

esperanto french german portuguese
master branch 11.104 GB/s 3.624 GB/s 6.622 GB/s 4.376 GB/s
current branch before the commit 14.673 GB/s 4.929 GB/s 9.492 GB/s 7.022 GB/s
current branch after the commit 14.415 GB/s 4.924 GB/s 9.492 GB/s 7.040 GB/s

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)

esperanto french german portuguese
master branch 11.256 GB/s 3.743 GB/s 6.780 GB/s 4.590 GB/s
current branch before the commit 14.167 GB/s 5.110 GB/s 9.676 GB/s 7.445 GB/s
current branch after the commit 14.167 GB/s 5.092 GB/s 9.771 GB/s 7.486 GB/s

I'm going to continue the investigation in a couple of days.
plan:
*I suspect that building it as a shared lib might help as it would prevent access of msvc to the rest of the code.
Supposedly, that wouldn't allow it to perform some smart optimisations and thus results should be more stable.

  • try more compilers.
  • check how adding/removing other sse implementations affects performance
  • try unrolling other implementations as well

For now, I suggest considering unrolling as unstable.

@aqrit
Copy link

aqrit commented Sep 20, 2023

Could you add some clarity?
The PR is for sse_convert_latin1_to_utf8.
But you're discussing a performance regression with sse_convert_utf16_to_latin1 ?

This PR :
I suspect the branch for 'ASCII fast path' will interfere with unrolling attempts.

The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port.
(I imagine there are 6 shuffles per 16 bytes of input)

The CPU can have 20+ loads "in-flight" at once. The total loads issued per cycle is not that interesting for unrolling (we always have the same number of loads). It is mostly for scheduling instructions at the assembly level, and to ballpark whether we're IO bound (e.g. do we want to replace a lookup with a calculation, are we spilling and reloading register, etc.).

@aspic-fish
Copy link
Contributor Author

Tables contain results for sse_convert_latin1_to_utf8 only. But somehow 7761599 commit causes sse_convert_latin1_to_utf8 regression.
I just go to the brunch sse_convert_latin1_to_utf8_perf, rebuild all and run the benchmark. Results are placed in the "before the commit" row.
Then I just do git rebase 7761599262953df2a1d9c3427d0d27d1cb044615, rebuild all and run the benchmark again. Results are placed in the "after the commit" row.

@aspic-fish
Copy link
Contributor Author

@aqrit,

The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port.
(I imagine there are 6 shuffles per 16 bytes of input)

according to uops from Nehalem to Ivy Bridge there are 2 ports for shuffle.
And I use 2 _mm_shuffle_epi8 per _mm_load_si128 to split input into 2 vectors. They are pipelined and should go in 1 cycle both.
And 2 more to pack vectors into utf-8 just after lookup. +1 cycle per shuffle
I don't think shuffle is a candidate for a bottleneck. But LUT surely is. Google says L1 cache hit takes 4 cycles in best case.
There're 2 lookups per load, so at least 8 cycles or many more in case of a cache miss.

But _mm_load_si128 is a bottleneck too. A smaller one. It's latency is 6. My original thought was that it actually might not be pipelined automatically. In this case, 2 sequential calls of it would give 32 bytes instead of 16 for the same latency.

The CPU can have 20+ loads "in-flight" at once.

but x86-64 has only 16 sse registers, how could it be 20+?
Could you provide keywords for googling?

PS: I'm actually new to all these simd stuff so excuse me if I say something stupid)

@lemire
Copy link
Member

lemire commented Sep 21, 2023

x86-64 has only 16 sse registers, how could it be 20+?

These are named registers, but we have many more registers.

You can examine the issue experimentally...
https://lemire.me/blog/2022/06/07/memory-level-parallelism-intel-ice-lake-versus-amazon-graviton-3/

@aqrit
Copy link

aqrit commented Sep 21, 2023

A Nehalem would not have a problem with the shuffles. However, a Haswell or Skylake might. On what CPU are you performing the benchmark?

The terms would be register renaming and out-of-order execution, I guess.

uops.info Code Analyzer is probably a good place to start, if we want to micro-optimize this.

@lemire
Copy link
Member

lemire commented Sep 21, 2023

@aqrit A fun one is this PR: #318

The westmere kernel (which is currently just scalar code, but subject to autovectorization) is faster than a reasonable hand-coded AVX2 routine. It is still fine because the differences are small... but it is clear that we could micro-optimize better.

@aspic-fish
Copy link
Contributor Author

I think I get it roughly. But I struggle to draw parallels with actual code in general. But fortunately, for this PR I'm not into tuning this exact implementation but rather introducing the msvc issue and whether it's present in other compilers as well as unrolling effect with different compilers.

I additionally checked clang and icx, both doesn't have this issue at least for this pr. But the results also showed that unrolling is not necessarily consistent between compilers for the same os. clang showed performance degradation for windows and ubuntu, unlike the rest. So, if we use unrolling, we have to choose how to handle such situations.

Tomorrow I'll try to found out whether building separately helps with msvc.
P.S. I got a build error with mingw, should I open an issue?

@aspic-fish
Copy link
Contributor Author

aspic-fish commented Sep 24, 2023

It doesn't. But I found that clang can actually benefit from unrolling. Just don't use inlined functions in loop body. For some reason they drop performance a lot.

UPD: I replaced inlined functions with macros. Now everything looks much better. Unroll factors are likely not the most optimal, there were some better ones during development. I'm gonna write a script to brute force them.

replace inline functions in loop body to macros
might also help with msvc
@aqrit
Copy link

aqrit commented Oct 1, 2023

cache miss

If we're willing to do 4 lookups per 16 bytes of input, then we'd only use 2 cache lines for tables.
https://gist.github.com/aqrit/5c914da98006874d0401983eb687e30e

Note: I haven't actually studied the utf16->utf8 function so I don't what that is doing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants