Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut down the LUT size by 2 KB #234

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented May 3, 2023

This encodes the length in the last element of shufutf8, similar to how it is in the UTF-16 table, but it is further minimized by making it 0xF0 | length since the last byte will always be zero.

This optimization could probably be applied to the UTF-16 tables for 512 more bytes saved as well but I'm nervous about the mask == 0 case.

This encodes the length in the last element of shufutf8, similar to
how it is in the UTF-16 table, but it is further minimized by making
it 0xF0 | length since the last byte will always be zero.
@lemire
Copy link
Member

lemire commented May 3, 2023

This PR increases the length of the data dependency chain, so it is at risk of leading to lower instructions per cycle performance.

Let us try it out. Build the library and benchmarks into the build subdirectory and grab the data files from https://github.com/lemire/unicode_lipsum (putting unicode_lipsum directly as a subdirectory).

Current main branch:

./build/benchmarks/benchmark -P convert_utf8_to_utf16+westmere -F unicode_lipsum/lipsum/*.utf8.txt | grep GB
   2.252 GB/s (3.6 %)    1.262 Gc/s     1.78 byte/char
   3.196 GB/s (2.3 %)    1.074 Gc/s     2.98 byte/char
   1.613 GB/s (2.6 %)    0.403 Gc/s     4.00 byte/char
   2.249 GB/s (3.0 %)    1.262 Gc/s     1.78 byte/char
   1.914 GB/s (4.1 %)    0.713 Gc/s     2.69 byte/char
   3.095 GB/s (3.6 %)    1.067 Gc/s     2.90 byte/char
   2.130 GB/s (3.2 %)    0.868 Gc/s     2.45 byte/char
  21.916 GB/s (6.8 %)   21.916 Gc/s     1.00 byte/char
   2.261 GB/s (4.5 %)    1.251 Gc/s     1.81 byte/char

Your PR:

$ ./build/benchmarks/benchmark -P convert_utf8_to_utf16+westmere -F unicode_lipsum/lipsum/*.utf8.txt |grep GB
   1.834 GB/s (3.8 %)    1.028 Gc/s     1.78 byte/char
   2.995 GB/s (4.7 %)    1.006 Gc/s     2.98 byte/char
   1.617 GB/s (3.2 %)    0.404 Gc/s     4.00 byte/char
   1.828 GB/s (2.7 %)    1.025 Gc/s     1.78 byte/char
   1.764 GB/s (4.6 %)    0.657 Gc/s     2.69 byte/char
   2.814 GB/s (3.9 %)    0.970 Gc/s     2.90 byte/char
   1.762 GB/s (4.1 %)    0.718 Gc/s     2.45 byte/char
  22.801 GB/s (6.5 %)   22.801 Gc/s     1.00 byte/char
   1.885 GB/s (2.8 %)    1.043 Gc/s     1.81 byte/char

As you'd expect, the performance does not always strike. From first principles, you'd expect it to mostly appear in "UTF-8 rich" inputs (where the size of the read offset is hard to predict). And indeed, that is what we see in these results: there is a 20% performance regression in several cases.

We do want to have small tables, but ideally, it should not create longer dependency chains. The "read offset" is on a critical path. We want to have it as soon as possible, upfront, and not in the latter part of the process. E.g., it should be immediately available from the mask, and not require an intermediate access.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented May 3, 2023

Hmm. Well I did some tests and apparently with the exception of mask == 0x888 (which lands in the rarer 4 byte path) the first and second halves of the size LUT are the same.

So that saves 2 KB without creating a data dependency at the cost of a single mask and a branch on a rarely taken path. (Although one downside is that it will have slightly worse cache locality).

For a more extreme solution packing nibbles is an option but that has extra branches.

@lemire
Copy link
Member

lemire commented May 3, 2023

@easyaspi314 Sure!!! Saving 2 kB if it is performance neutral would be huge. Can you try it out ?

I recommend you run benchmarks too.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented May 3, 2023

Well an initial test on an ARM Cortex-X1 (yes it is my phone) + Clang 16 shows about 5% overhead and probably can be tuned to be less.

I can get you exact benchmarks and x86 benchmarks once I clean it up.

Also, perhaps we could have a build option that either uses the last byte in the shuffle mask or the dedicated LUT, since there is no harm in leaving the length in the table. People who compile with MinSizeRel would much rather take 2 KB over 10-20% performance.

@lemire
Copy link
Member

lemire commented May 3, 2023

We care a lot about ARM.

Well an initial test on an ARM Cortex-X1 (yes it is my phone) + Clang 16 shows about 5% overhead and probably can be tuned to be less.

We really want performance neutrality. It is damn easy to sacrifice performance but difficult to get it back.

Measuring 5% is difficult. This being said, we are not bound by instruction counts in this path, but we are terribly limited by data dependency. Even just a few cycles of extra latency in this code matters a whole lot.

Also, perhaps we could have a build option that either uses the last byte in the shuffle mask or the dedicated LUT, since there is no harm in leaving the length in the table. People who compile with MinSizeRel would much rather take 2 KB over 10-20% performance.

Screenshot 2023-05-03 at 5 45 48 PM

4 kB represents ~1.3% of the library size. So you are saving ~1% in size for a ~20% drop in speed for some benchmarks. Is that good?

@clausecker
Copy link
Collaborator

clausecker commented May 3, 2023

These 5% are more than compensated for by reducing cache pressure. Remember, in real world applications, transcoding strings is only a small part of what an application does. Not trashing 4 kB of L1 cache is definitely worth a 5% perf loss in my opinion.

Given that most of the library is never accessed on any particular machine (as we have multiple kernels for different ISA levels) and given that this affects D$ instead of I$ (i.e. the more valuable resource, of which our library otherwise consumes very little), I believe this saving to be very significant. Especially in applications that only call this one routine.

This version only cuts half of the table, preventing the need for a
dependency on the index.
@lemire
Copy link
Member

lemire commented May 4, 2023

Running tests.

@lemire
Copy link
Member

lemire commented May 4, 2023

is definitely worth a 5% perf loss

Glancing at the code, I am not sure that this should cause a 5% perf loss. It is fairly difficult to measure a 5% difference, as we all know.

@easyaspi314
Copy link
Contributor Author

Cortex-X1, original, clang 16.0.2

   1.780 GB/s (3.2 %)    0.997 Gc/s     1.78 byte/char 
   2.899 GB/s (2.2 %)    0.974 Gc/s     2.98 byte/char 
   1.226 GB/s (4.7 %)    0.306 Gc/s     4.00 byte/char 
   1.774 GB/s (6.5 %)    0.995 Gc/s     1.78 byte/char 
   1.665 GB/s (3.2 %)    0.620 Gc/s     2.69 byte/char 
   2.521 GB/s (3.5 %)    0.869 Gc/s     2.90 byte/char 
   1.627 GB/s (2.8 %)    0.663 Gc/s     2.45 byte/char 
  24.559 GB/s (19.4 %)   24.559 Gc/s     1.00 byte/char - jittery af
   1.796 GB/s (5.2 %)    0.994 Gc/s     1.81 byte/char 

Half LUT: There seem to be some gains and losses but this can also be assumed to jitter. As expected, the path with the branch (which I was wrong, it is the 2 byte path) is a tiny bit slower.

   1.635 GB/s (2.5 %)    0.916 Gc/s     1.78 byte/char -> -8%
   3.333 GB/s (4.6 %)    1.120 Gc/s     2.98 byte/char -> +13%
   1.314 GB/s (5.0 %)    0.328 Gc/s     4.00 byte/char -> +6%
   1.620 GB/s (2.9 %)    0.909 Gc/s     1.78 byte/char -> -9%
   1.711 GB/s (2.4 %)    0.637 Gc/s     2.69 byte/char -> +3%
   2.654 GB/s (1.1 %)    0.915 Gc/s     2.90 byte/char -> +5%
   1.581 GB/s (1.8 %)    0.645 Gc/s     2.45 byte/char -> -5%
  23.485 GB/s (10.0 %)   23.485 Gc/s     1.00 byte/char -> jittery af
   1.635 GB/s (1.1 %)    0.905 Gc/s     1.81 byte/char -> -6%

Will edit with SSE4 and AVX2 benches soon

@lemire
Copy link
Member

lemire commented May 4, 2023

Here are my results (compare with above) on this PR....

Before...

./build/benchmarks/benchmark -P convert_utf8_to_utf16+westmere -F unicode_lipsum/lipsum/*.utf8.txt | grep GB
   2.252 GB/s (3.6 %)    1.262 Gc/s     1.78 byte/char
   3.196 GB/s (2.3 %)    1.074 Gc/s     2.98 byte/char
   1.613 GB/s (2.6 %)    0.403 Gc/s     4.00 byte/char
   2.249 GB/s (3.0 %)    1.262 Gc/s     1.78 byte/char
   1.914 GB/s (4.1 %)    0.713 Gc/s     2.69 byte/char
   3.095 GB/s (3.6 %)    1.067 Gc/s     2.90 byte/char
   2.130 GB/s (3.2 %)    0.868 Gc/s     2.45 byte/char
  21.916 GB/s (6.8 %)   21.916 Gc/s     1.00 byte/char
   2.261 GB/s (4.5 %)    1.251 Gc/s     1.81 byte/char

After:

$ ./build/benchmarks/benchmark -P convert_utf8_to_utf16+westmere -F unicode_lipsum/lipsum/*.utf8.txt |grep GB
   2.073 GB/s (3.4 %)    1.162 Gc/s     1.78 byte/char
   2.969 GB/s (5.5 %)    0.997 Gc/s     2.98 byte/char
   1.585 GB/s (3.2 %)    0.396 Gc/s     4.00 byte/char
   2.066 GB/s (3.3 %)    1.159 Gc/s     1.78 byte/char
   1.873 GB/s (3.0 %)    0.698 Gc/s     2.69 byte/char
   2.948 GB/s (3.7 %)    1.016 Gc/s     2.90 byte/char
   2.139 GB/s (2.7 %)    0.872 Gc/s     2.45 byte/char
  22.419 GB/s (5.4 %)   22.419 Gc/s     1.00 byte/char
   2.071 GB/s (2.3 %)    1.146 Gc/s     1.81 byte/char

Binary size, before...

$ ls -al build/src/*.a
-rw-r--r--  1 lemire  staff  382216  3 May 12:12 build/src/libsimdutf.a

After...

$ ls -al build/src/*.a
-rw-r--r--  1 lemire  staff  381736  3 May 21:55 build/src/libsimdutf.a

So this PR appears to shave 480 bytes from the library on my machine right now. (So 0.1%.)

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented May 4, 2023

So this PR appears to shave 480 bytes from the library on my machine right now. (So 0.1%.)

The .a file isn't really indicative of the actual binary size, it is recommended is to do size build/src/CMakeFiles/simdutf.dir/simdutf.cpp.o which will count the actual code and data size instead of including things like symbol tables. (although parts of it will be cut out at link time due to -ffunction-sections).

However, it is true that the majority of the library is code that will be stripped out by the linker, but as long as the utf8 to utf16/32 code is included that will have a size reduction.

I'm having trouble getting a stable benchmark on my laptop (in one scenario SSE4 was faster than AVX2 :trollface:), I might have to do a clean Linux boot with turbo boost and power management off. Windows doesn't keep a stable frequency.

Edit: Also on a crappy Cortex-A53 tablet with GCC 11 there is basically no difference, but the A53 is more about single instruction latency than dependency chains since it is strictly in-order.

@easyaspi314 easyaspi314 changed the title Cut down the LUT size by 4 KB Cut down the LUT size by 2 KB May 4, 2023
@lemire
Copy link
Member

lemire commented May 4, 2023

My concern is that according to my naive view, your PR should be performance neutral... but it seems that it is not. I see a measurable impact (up to 10%). So I suggest that microoptimization is in order, or, at least, an analysis.

We should understand why it would make things measurably slower.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented May 4, 2023

I think that there is a microoptimization issue, GCC x64 seems to get wildly different performance depending on where the branches are.

@easyaspi314 easyaspi314 marked this pull request as draft May 9, 2023 00:52
@easyaspi314
Copy link
Contributor Author

Drafting this for now, as I want to improve the intrinsics first before touching the table so I can get a better performance analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants