Support cpuid-based selection of top-level functions #861
Replies: 11 comments
-
The problem with supporting instruction sets like BMI2 is that compression/decompression programs heavily depend on bit manipulation, so basically the time-critical functions are the ones that would benefit the most from use of the instructions. This means making the code optional would most likely counter most of the speed gain from the speedup of bit manipulation instructions. |
Beta Was this translation helpful? Give feedback.
-
@mtl1979 That's why I'm suggesting compiling the top-level functions for it, and switching at that level, rather than trying to optimize a specific internal function with it. |
Beta Was this translation helpful? Give feedback.
-
@joshtriplett That would increase the library size by factor of 6 or more... |
Beta Was this translation helpful? Give feedback.
-
@mtl1979 I don't think it would require 6 different variations, only variations that actually make a difference in practice. I tested every single CPU feature available on my Kaby Lake laptop, and the only one that made a difference like this was bmi2. (Also, if someone is looking to optimize for code size rather than performance, they could always disable this.) |
Beta Was this translation helpful? Give feedback.
-
@joshtriplett Basically for 64-bit Intel you would need version with just SSE2, version with SSE3, version with SSSE3, SSE4, SSE4.2, AVX2. For AMD processors it would get even more complicated. |
Beta Was this translation helpful? Give feedback.
-
I wonder which functions would make the most difference.. |
Beta Was this translation helpful? Give feedback.
-
@nmoinvaz I would assume macros/code that do both shifting and masking... |
Beta Was this translation helpful? Give feedback.
-
I'm not suggesting changing the approach that zlib-ng currently uses for other instruction sets. I'm just suggesting this approach for bmi2, because its usage is so intertwined into other functions. |
Beta Was this translation helpful? Give feedback.
-
@joshtriplett I'm saying it doesn't make sense to make exception for one small instruction set, when it can already get enabled when larger instruction set is also enabled. BMI2 is quite recent instruction set and is only supported on some recent Intel processors, Zen-based AMD processors claim to recognize the instructions, but actually emulate them in microcode instead of having real hardware support. |
Beta Was this translation helpful? Give feedback.
-
So I recently arrived at this same conclusion independently (tried to narrow down why -march=native had such a measurable impact). I deduced that the thing that made the most difference was the shifting operations on the ALU that don't affect flags (e.g. SHLX, SARX, etc). The reason for this is a couple fold:
|
Beta Was this translation helpful? Give feedback.
-
We could theoretically extend the use of functable for deciding if we want code compiled with BMI2 or not... It would make both the build system and the affected source files a little harder to read, but until I see real benchmarks of the final code, I can't really say if it's worth the trouble... We delayed adding support for AVX512 for quite a while due to lack of hardware to test and benchmark the actual code and it still seems even not all processors that support it, benefit from use of it. |
Beta Was this translation helpful? Give feedback.
-
Optimizing the whole zlib-ng library for a more capable processor provides a noticeable speed increase. I tried various different flags, and managed to figure out that the primary benefit comes from allowing
-mbmi2
:I think it'd be beneficial to compile multiple copies of some functions, such as
inflate.c
andinflate_fast.c
, with different compiler options and a prefix on functions, and then use the cached CPUID result to determine which function to call. (Macros similar toPREFIX
would work for this, applied to non-static functions and to calls to non-static functions.)Based on extensive experimentation with compiler options, I'd suggest adding a copy compiled with
-mbmi2
(called when the corresponding CPUID feature exists); that's the only option that appears to give a substantial improvement.Beta Was this translation helpful? Give feedback.
All reactions