New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cut down the LUT size by 2 KB #234
base: master
Are you sure you want to change the base?
Conversation
This encodes the length in the last element of shufutf8, similar to how it is in the UTF-16 table, but it is further minimized by making it 0xF0 | length since the last byte will always be zero.
This PR increases the length of the data dependency chain, so it is at risk of leading to lower instructions per cycle performance. Let us try it out. Build the library and benchmarks into the build subdirectory and grab the data files from https://github.com/lemire/unicode_lipsum (putting unicode_lipsum directly as a subdirectory). Current main branch:
Your PR:
As you'd expect, the performance does not always strike. From first principles, you'd expect it to mostly appear in "UTF-8 rich" inputs (where the size of the read offset is hard to predict). And indeed, that is what we see in these results: there is a 20% performance regression in several cases. We do want to have small tables, but ideally, it should not create longer dependency chains. The "read offset" is on a critical path. We want to have it as soon as possible, upfront, and not in the latter part of the process. E.g., it should be immediately available from the mask, and not require an intermediate access. |
Hmm. Well I did some tests and apparently with the exception of So that saves 2 KB without creating a data dependency at the cost of a single mask and a branch on a rarely taken path. (Although one downside is that it will have slightly worse cache locality). For a more extreme solution packing nibbles is an option but that has extra branches. |
@easyaspi314 Sure!!! Saving 2 kB if it is performance neutral would be huge. Can you try it out ? I recommend you run benchmarks too. |
Well an initial test on an ARM Cortex-X1 (yes it is my phone) + Clang 16 shows about 5% overhead and probably can be tuned to be less. I can get you exact benchmarks and x86 benchmarks once I clean it up. Also, perhaps we could have a build option that either uses the last byte in the shuffle mask or the dedicated LUT, since there is no harm in leaving the length in the table. People who compile with MinSizeRel would much rather take 2 KB over 10-20% performance. |
We care a lot about ARM.
We really want performance neutrality. It is damn easy to sacrifice performance but difficult to get it back. Measuring 5% is difficult. This being said, we are not bound by instruction counts in this path, but we are terribly limited by data dependency. Even just a few cycles of extra latency in this code matters a whole lot.
4 kB represents ~1.3% of the library size. So you are saving ~1% in size for a ~20% drop in speed for some benchmarks. Is that good? |
These 5% are more than compensated for by reducing cache pressure. Remember, in real world applications, transcoding strings is only a small part of what an application does. Not trashing 4 kB of L1 cache is definitely worth a 5% perf loss in my opinion. Given that most of the library is never accessed on any particular machine (as we have multiple kernels for different ISA levels) and given that this affects D$ instead of I$ (i.e. the more valuable resource, of which our library otherwise consumes very little), I believe this saving to be very significant. Especially in applications that only call this one routine. |
This version only cuts half of the table, preventing the need for a dependency on the index.
Running tests. |
Glancing at the code, I am not sure that this should cause a 5% perf loss. It is fairly difficult to measure a 5% difference, as we all know. |
Cortex-X1, original, clang 16.0.2
Half LUT: There seem to be some gains and losses but this can also be assumed to jitter. As expected, the path with the branch (which I was wrong, it is the 2 byte path) is a tiny bit slower.
Will edit with SSE4 and AVX2 benches soon |
Here are my results (compare with above) on this PR.... Before...
After:
Binary size, before...
After...
So this PR appears to shave 480 bytes from the library on my machine right now. (So 0.1%.) |
The However, it is true that the majority of the library is code that will be stripped out by the linker, but as long as the utf8 to utf16/32 code is included that will have a size reduction. I'm having trouble getting a stable benchmark on my laptop (in one scenario SSE4 was faster than AVX2 ), I might have to do a clean Linux boot with turbo boost and power management off. Windows doesn't keep a stable frequency. Edit: Also on a crappy Cortex-A53 tablet with GCC 11 there is basically no difference, but the A53 is more about single instruction latency than dependency chains since it is strictly in-order. |
My concern is that according to my naive view, your PR should be performance neutral... but it seems that it is not. I see a measurable impact (up to 10%). So I suggest that microoptimization is in order, or, at least, an analysis. We should understand why it would make things measurably slower. |
I think that there is a microoptimization issue, GCC x64 seems to get wildly different performance depending on where the branches are. |
Drafting this for now, as I want to improve the intrinsics first before touching the table so I can get a better performance analysis. |
This encodes the length in the last element of shufutf8, similar to how it is in the UTF-16 table, but it is further minimized by making it 0xF0 | length since the last byte will always be zero.
This optimization could probably be applied to the UTF-16 tables for 512 more bytes saved as well but I'm nervous about the mask == 0 case.