New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate loop unrolling #317
base: master
Are you sure you want to change the base?
investigate loop unrolling #317
Conversation
Could you add some clarity? This PR : The bottleneck is probably: latency from the random lookups and the thruput of the shuffle port. The CPU can have 20+ loads "in-flight" at once. The total loads issued per cycle is not that interesting for unrolling (we always have the same number of loads). It is mostly for scheduling instructions at the assembly level, and to ballpark whether we're IO bound (e.g. do we want to replace a lookup with a calculation, are we spilling and reloading register, etc.). |
Tables contain results for |
according to uops from But
but PS: I'm actually new to all these simd stuff so excuse me if I say something stupid) |
These are named registers, but we have many more registers. You can examine the issue experimentally... |
A The terms would be register renaming and out-of-order execution, I guess. uops.info Code Analyzer is probably a good place to start, if we want to micro-optimize this. |
I think I get it roughly. But I struggle to draw parallels with actual code in general. But fortunately, for this PR I'm not into tuning this exact implementation but rather introducing the I additionally checked Tomorrow I'll try to found out whether building separately helps with |
It doesn't. But I found that UPD: I replaced inlined functions with macros. Now everything looks much better. Unroll factors are likely not the most optimal, there were some better ones during development. I'm gonna write a script to brute force them. |
replace inline functions in loop body to macros might also help with msvc
If we're willing to do 4 lookups per 16 bytes of input, then we'd only use 2 cache lines for tables. Note: I haven't actually studied the utf16->utf8 function so I don't what that is doing... |
Recently I've learned that all Intel processors since
Sandy Bridge
can do two_mm_loadu_si128
at the same time with port 2 and 3. So I tried 2 sequential_mm_loadu_si128
and it was a success. Then I also tried 4 and 8. 4 gave me an additional boost, but not 8.Alas, when I pulled upstream commits, I got a significant performance penalty for the
esperanto
file withmsvc
. So I dropped them and started adding one by one. And I found which one causes it.7761599 SSE UTF16 => latin1 (#311)
It seems there's nothing special here. It just added a new dependency with 2 other
sse
implementations.So I also checked with
gcc
and there was no penalty.Could it be a
msvc
bug?inlined version
======================================================================"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command
benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch:
Sandy Bridge
======================================================================
windows 10
msvc VS 17.5.5
msvc VS 17.7.4
LLVM(clang-cl) 16.0.5
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)
mingw-w64-ucrt-x86_64-gcc 13.1.0-7
build error.
======================================================================
wsl2 ubuntu 22.04
gcc 11.4.0
clang 14.0.0-1ubuntu1.1
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
The situation got even funnier when I removed all the loops except this one, and got the opposite result. And it's quite consistent between benchmarks.
msvc VS 17.5.5
macros version
======================================================================"the commit" is 7761599,
current branch is sse_convert_latin1_to_utf8_perf
command
benchmark -P convert_latin1_to_utf8+westmere -F *.latin1.txt
arch:
Sandy Bridge
======================================================================
windows 10
msvc VS 17.7.4
LLVM(clang-cl) 16.0.5
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230627)
mingw-w64-ucrt-x86_64-gcc 13.1.0-7
build error.
======================================================================
wsl2 ubuntu 22.04
gcc 11.4.0
clang 14.0.0-1ubuntu1.1
Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230721)
I'm going to continue the investigation in a couple of days.
plan:
*
I suspect that building it as a shared lib might help as it would prevent access ofmsvc
to the rest of the code.Supposedly, that wouldn't allow it to perform some smart optimisations and thus results should be more stable.
sse
implementations affects performanceFor now, I suggest considering unrolling as unstable.