Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

markreidvfx · 2022-09-11T05:42:44Z

I'm still messing around with this but wanted to share a work in progress for some feedback.
This is based off of work I've done with 3d luts here #1681 and most of the code is ported from that project.

Here are some of the current performance results

ocioperf.exe --transform tests/data/files/clf/lut1d_32f_example.clf

ocioperf.exe --transform tests/data/files/clf/lut3d_preview_tier_test.clf

Supporting additional x86 SIMD instruction sets adds more complexity to the build system. Some of following things need to get considered.

is the instruction set enabled in the build?
is the instruction set supported by compiler and what are the flags to turn it on?
is the instruction set supported by cpu?

The really tricky bit is that what SIMD instruction sets a cpu has varies between models and brands. If a cpu encounters an instruction it doesn't have, a program will just crash. So you can't just turn on the AVX/AVX2 compiler flags for the whole build if you want to run on a wide variety of systems. Instead, each implementation is a separate compilation unit and the compiler flags are only used on that unit.
The cpuid instruction can be used at runtime to determine what instructions your cpu has and the best implementation can be chosen then, currently being done with a function pointer.

Some of the things I' currently thinking of doing

Cleanup and move CMake SIMD detection to separate cmake files
Put CPUInfo class in platform.cpp?
Change the file naming, was thinking something like this
- ops/lut3d/x86/avx2.cpp
- ops/lut3d/x86/avx.cpp
- ops/lut3d/x86/sse2.cpp

The pull request is also very large so I was thinking I might break it into separate smaller requests.
Perhaps one for the AVX2/AVX build support/tests, one for lut3d and one for lut1d.

linux-foundation-easycla · 2022-09-11T05:42:47Z

The committers listed above are authorized under a signed CLA.

✅ login: markreidvfx / name: Mark Reid (540c6a6, 9d4d108, 7b9beeb, c4f4961, 0df767c, d8a575b, 0e1f8ab, c77b449, e6aae81, 33810d3)
✅ login: doug-walker / name: Doug Walker (80d0d1f)

doug-walker · 2022-09-13T02:15:41Z

@markreidvfx, it's going to take some time for me to do a proper review, but I just wanted to say thank you so much for this PR! I especially appreciate the comments explaining the packing being used with the intrinsics and the many unit tests. It's really great to have you contributing to the project!

I think the naming is fine as is (leaving lut1d or lut3d in the module names) and it's fine to leave it in one big PR, if that's easiest.

markreidvfx · 2022-09-18T02:56:12Z

Thanks @doug-walker :) I tried to add a lot test for the packing/unpacking because it can be tricky to get right, and I'm hoping the infrastructure could be use for adding SIMD acceleration to other ops in the future.

CMakeLists.txt

cedrik-fuoco-adsk · 2023-03-13T16:00:25Z

src/OpenColorIO/ops/lut1d/Lut1DOpCPU_SSE2.h

+
+#include "CPUInfo.h"
+
+typedef void (Lut1DOpCPUApplyFunc)(const float *, const float *, const float *, int, const void *, void *, long);


You mentioned that you were using a function point, just wondering if you thought of using std::function?
I think it is fair to use a function pointer here (for speed). The overhead from std::function might be unnecessary.

std::function isn't something I've used before, so not sure what the benefit would be? I typically avoid fancy c++ features when focusing on performance, haha, I'll take a look.

CMakeLists.txt

src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp

cedrik-fuoco-adsk · 2023-03-13T17:38:49Z

Thank you for the PR @markreidvfx, the implementation looks great! I commented on a few minor things and asked some questions.

Cleanup and move CMake SIMD detection to separate cmake files
Put CPUInfo class in platform.cpp?
Change the file naming, was thinking something like this
- ops/lut3d/x86/avx2.cpp
- ops/lut3d/x86/avx.cpp
- ops/lut3d/x86/sse2.cpp
That would be preferable as I mentioned in the comments section.
I don't necessarily have a strong opinion, but it is indeed a platform-specific thing, so that could make sense. (see comment in PR)
I like the folder structure as it feels more organized, but I do think that both work and have different arguments going for them. I'm leaning a bit more toward the folder side as I find it a more structured, organized and simpler filename.

It is going to conflict a bit with the work done in Adsk Contrib - Add support for neon intrinsic #1775 but it shouldn't be too major.

markreidvfx · 2023-03-18T01:00:31Z

Thanks for reviewing the pull request! Its been a while since I looked at this code, I'll take a deeper dive when I get a chance.

doug-walker · 2023-03-18T01:11:52Z

Awesome, please keep us posted, thanks Mark!

remia

Impressive work @markreidvfx, I haven't fully reviewed yet and probably lack the required experience to really contribute here but realised I left a couple of pending notes for a while now. I will try to get a closer look later.

CMakeLists.txt

src/OpenColorIO/CPUInfo.cpp

src/OpenColorIO/SSE2.h

src/OpenColorIO/AVX.h

markreidvfx · 2023-04-08T23:26:12Z

Thanks again everyone for taking the time to review!
@cedrik-fuoco-adsk I'll move the cmake code to a separate file.
I'm going leave the rest of folder/file structure as it is for now, unless anyone objects.

tests/cpu/CPUProcessor_tests.cpp

src/OpenColorIO/ops/lut3d/Lut3DOpCPU_AVX2.cpp

lgritz · 2023-05-05T16:26:48Z

I don't want to derail this PR, it's really none of my business, but I thought it might be helpful to give a data point about how we handled SIMD in OIIO and OSL.

I think I'm reasonably savvy about hardware features? But I stumble over the intrinsics constantly, can't remember what they mean without looking each one up, and generally find code littered with intrinsics to be nearly impossible to maintain (not to mention that it must be repeated for each ISA you want to code). And code that uses the intrinsics is unreviewable by anybody not intimately familiar with the instruction sets and what each intrinsic does.

So the approach I took in OIIO is to hide it all behind intuitive vector classes in a single header file. This one header is the only place in the entire code base where a CPU-specific intrinsic can be found. The implementations -- be they SSE, AVX, NEON, as well as non-SIMD reference/fallback code -- are within each function or method, separated by appropriate #if guards. This means that places that use the intrinsics (i.e. use the classes) look generic, need no separate implementations for each CPU ISA, and are totally straighforward to read, understand, and review, without any knowledge of CPU intrinsics at all (as long as you trust the underlying class implementations).

Here is an example of how 4-wide SIMD is used to accelerate an "over" operation. Clear, yes? And no separate code needed for each ISA.

Here is an example that is even better, a fast implementation of exp2. Where are the intrinsics? They're all hidden behind the templating, because all the right functions and operators are overloaded, so you can say fast_exp2(float) or fast_exp2(vfloat4) or fast_exp2(vfloat8), or whatever. No fuss, no coding it separately for every ISA.

Last example, this time from OSL (which uses OIIO's simd.h, and please excuse the use of the old name, "float4" instead of the new name "vfloat4"), of how OSL implements Perlin noise with OIIO's simd classes. This works for both SSE and NEON, as well as fully non-SIMD on other architectures.

Honestly, there's nothing special about OIIO's simd.h. There are other implementations of SIMD vector classes out there that are roughly equivalent. But I want to make a case for the improved readability of restricting the literal reference to the intrinsic names to just one place in the code base, and wrap them with classes that make all the other scattered uses very intuitive, readable, maintainable even by non-experts, and templatable.

doug-walker · 2023-05-05T18:18:24Z

Thank you @lgritz , for bringing that to our attention! I agree, it's a much more readable and maintainable approach.

markreidvfx · 2023-05-30T02:41:00Z

Sorry it took me so long to get back to this.
I'm very comfortable with intel intrinsics which is the main reason I did it this way. Personally, I like the simplicity of just using the intrinsic directly without needing to know inner workings of some complex wrapper. I also like being able to freely modified one platform with zero worry of effecting another. There are a few other reasons but I can totally see how people can be turned off by this approach.

I'll take a deeper look at OIIO simd header when I get some time but at first glance looks pretty straight forward. I'm not doing anything too fancy but also not sure how this would effect performance, without porting and measuring. Everything will need to be reworked if this is route we wish to take.

@lgritz out of curiosity, how does OIIO do single binary builds that support multiple x86 simd instruction sets dynamically?

lgritz · 2023-05-30T04:14:17Z

OIIO doesn't currently do single binary builds that support multiple ISAs. It's chosen at build time. OSL does have something relevant, though, where certain functions that are worth building with ISA-specific instructions are put into a secondary library and compiled separately for several ISAs, then the specific one is incorporated at runtime via dlopen'ing the one that corresponds to the hardware found.

doug-walker

This is really awesome, thanks again @markreidvfx !

As mentioned, we plan to include this in OCIO 2.3.0. Do you agree that we should remove the "Draft" flag from the PR? Is there anything else you think needs to be added right now?

doug-walker · 2023-08-19T04:55:13Z

src/OpenColorIO/ops/lut3d/Lut3DOpCPU_AVX.cpp

+
+    __m256 next_r = _mm256_min_ps(lut_max, _mm256_add_ps(prev_r, one_f));
+    __m256 next_g = _mm256_min_ps(lut_max, _mm256_add_ps(prev_g, one_f));
+    __m256 next_b = _mm256_min_ps(lut_max, _mm256_add_ps(prev_b, one_f));


If the unit tests are passing, then an input value of NaN is filtered to zero somewhere, as desired. It was more obvious in the previous SSE implementation where that happened. Where does that happen here?

Values are scaled and clamped before being passed to interp_tetrahedral.

The trick to clamping the NaNs to zero is to use max_ps(value, zero) before min_ps(value, max_value). It is also important for the second arg of the max_ps intrinsic to be the min/zero arg and not the input pixel value.

Here is a small test program showing it working on every possible float value.
https://godbolt.org/z/3439cvPe8

On a side note, I've noticed using this technique can causes issues when using sse2neon.h on clang. I believe it to be a bug in clang, but haven't reported it to them yet.
DLTcollab/sse2neon#606

Oh you're right, it happens in the caller now.

I looked at your DLTcollab link, we are using different instructions for our min/max implementation in Neon. Please see this PR in SSE.h on lines 35-54.

I took a quick look at the PR see its using vmaxnmq_f32 which is the problem. The fmaxnm instruction only handles quiet NaNs and not the so call Signalling NaNs. I'll continue this discussion on that PR.

I don't think we need/want to suppress Signalling NaNs, do we? My understanding is that arithmetic operations only generate Quiet NaNs and Signaling NaNs are only set programmatically (e.g. for debugging).

I personally think its a good idea to clamp them all to zero regardless in the LUT case. Especially since pixel values are user supplied and being used to calculate memory offsets.

src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp

doug-walker · 2023-08-19T05:01:03Z

share/cmake/utils/CheckSupportX86SIMD.cmake

+
+###############################################################################
+# Check if compiler supports X86 SIMD extensions
+


In other cases, we have cmake try to compile a small sample program that uses the feature. Perhaps that would be more reliable than using check_cxx_compiler_flag? Cedrik offered to add this in a separate PR.

Cool, that does sound like it might be more reliable. If it can be done in a separate PR that would be great.

markreidvfx · 2023-08-19T16:35:23Z

I think its good to remove the Draft.

There is a better SSE2 fallback I'd like to add for cpu's that don't have the F16C extensions, but I can do that in a later pull request.

Signed-off-by: Mark Reid <mindmark@gmail.com>

…ion#1687) * Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix a bunch of typos Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove USE_SSE code that is no longer needed Signed-off-by: Mark Reid <mindmark@gmail.com> * Use alignas specifier Signed-off-by: Mark Reid <mindmark@gmail.com> * Move x86 simd checking code to seperate file Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix cacheID test, compare lengths and everything but the cacheID hash Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove debug gather code Signed-off-by: Mark Reid <mindmark@gmail.com> * fixed outBD typo Signed-off-by: Mark Reid <mindmark@gmail.com> --------- Signed-off-by: Mark Reid <mindmark@gmail.com> Co-authored-by: Doug Walker <doug.walker@autodesk.com> Signed-off-by: Brooke <beg9562@rit.edu>

…ion#1687) * Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix a bunch of typos Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove USE_SSE code that is no longer needed Signed-off-by: Mark Reid <mindmark@gmail.com> * Use alignas specifier Signed-off-by: Mark Reid <mindmark@gmail.com> * Move x86 simd checking code to seperate file Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix cacheID test, compare lengths and everything but the cacheID hash Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove debug gather code Signed-off-by: Mark Reid <mindmark@gmail.com> * fixed outBD typo Signed-off-by: Mark Reid <mindmark@gmail.com> --------- Signed-off-by: Mark Reid <mindmark@gmail.com> Co-authored-by: Doug Walker <doug.walker@autodesk.com> Signed-off-by: Doug Walker <Doug.Walker@autodesk.com>

cedrik-fuoco-adsk reviewed Mar 13, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

cedrik-fuoco-adsk reviewed Mar 13, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

cedrik-fuoco-adsk reviewed Mar 13, 2023

View reviewed changes

src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp Outdated Show resolved Hide resolved

cedrik-fuoco-adsk reviewed Mar 13, 2023

View reviewed changes

src/OpenColorIO/ops/lut3d/Lut3DOpCPU.cpp Outdated Show resolved Hide resolved

remia reviewed Mar 20, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

src/OpenColorIO/CPUInfo.cpp Outdated Show resolved Hide resolved

src/OpenColorIO/SSE2.h Outdated Show resolved Hide resolved

src/OpenColorIO/AVX.h Show resolved Hide resolved

markreidvfx force-pushed the lut_simd_enhancements_v1 branch from fad8220 to 370dad8 Compare April 10, 2023 05:31

cedrik-fuoco-adsk reviewed Apr 11, 2023

View reviewed changes

tests/cpu/CPUProcessor_tests.cpp Outdated Show resolved Hide resolved

doug-walker mentioned this pull request Apr 13, 2023

Failing tests when FMA instruction is used #1784

Open

remia reviewed May 4, 2023

View reviewed changes

src/OpenColorIO/ops/lut3d/Lut3DOpCPU_AVX2.cpp Outdated Show resolved Hide resolved

markreidvfx force-pushed the lut_simd_enhancements_v1 branch from a98d48b to aae4470 Compare May 4, 2023 19:20

cedrik-fuoco-adsk approved these changes May 17, 2023

View reviewed changes

doug-walker approved these changes Aug 19, 2023

View reviewed changes

markreidvfx added 5 commits August 19, 2023 15:39

Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates

540c6a6

Signed-off-by: Mark Reid <mindmark@gmail.com>

Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations

9d4d108

Signed-off-by: Mark Reid <mindmark@gmail.com>

Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations

7b9beeb

Signed-off-by: Mark Reid <mindmark@gmail.com>

Fix a bunch of typos

c4f4961

Signed-off-by: Mark Reid <mindmark@gmail.com>

Remove USE_SSE code that is no longer needed

0df767c

Signed-off-by: Mark Reid <mindmark@gmail.com>

markreidvfx added 5 commits August 19, 2023 15:39

Use alignas specifier

d8a575b

Signed-off-by: Mark Reid <mindmark@gmail.com>

Move x86 simd checking code to seperate file

0e1f8ab

Signed-off-by: Mark Reid <mindmark@gmail.com>

Fix cacheID test, compare lengths and everything but the cacheID hash

c77b449

Signed-off-by: Mark Reid <mindmark@gmail.com>

Remove debug gather code

e6aae81

Signed-off-by: Mark Reid <mindmark@gmail.com>

fixed outBD typo

33810d3

Signed-off-by: Mark Reid <mindmark@gmail.com>

markreidvfx force-pushed the lut_simd_enhancements_v1 branch from aae4470 to 33810d3 Compare August 20, 2023 02:55

markreidvfx marked this pull request as ready for review August 20, 2023 04:02

Merge branch 'main' into lut_simd_enhancements_v1

80d0d1f

doug-walker merged commit 9cc2486 into AcademySoftwareFoundation:main Aug 23, 2023
22 checks passed

cedrik-fuoco-adsk mentioned this pull request Aug 24, 2023

Adsk contrib - Add support for neon intrinsic integration #1828

Merged

doug-walker mentioned this pull request Sep 22, 2023

Tetrahedral Lut3D CPU SIMD Optimizations #1681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

markreidvfx commented Sep 11, 2022

linux-foundation-easycla bot commented Sep 11, 2022 •

edited

doug-walker commented Sep 13, 2022

markreidvfx commented Sep 18, 2022 •

edited

cedrik-fuoco-adsk Mar 13, 2023

markreidvfx Apr 8, 2023 •

edited

cedrik-fuoco-adsk commented Mar 13, 2023

markreidvfx commented Mar 18, 2023

doug-walker commented Mar 18, 2023

remia left a comment

markreidvfx commented Apr 8, 2023

lgritz commented May 5, 2023

doug-walker commented May 5, 2023

markreidvfx commented May 30, 2023

lgritz commented May 30, 2023

doug-walker left a comment

doug-walker Aug 19, 2023

markreidvfx Aug 19, 2023 •

edited

doug-walker Aug 19, 2023

markreidvfx Aug 19, 2023

doug-walker Aug 19, 2023

markreidvfx Aug 20, 2023

doug-walker Aug 19, 2023

markreidvfx Aug 19, 2023

markreidvfx commented Aug 19, 2023


		#include "CPUInfo.h"

		typedef void (Lut1DOpCPUApplyFunc)(const float , const float , const float , int, const void , void *, long);


		###############################################################################
		# Check if compiler supports X86 SIMD extensions

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

Conversation

markreidvfx commented Sep 11, 2022

linux-foundation-easycla bot commented Sep 11, 2022 • edited

doug-walker commented Sep 13, 2022

markreidvfx commented Sep 18, 2022 • edited

Choose a reason for hiding this comment

markreidvfx Apr 8, 2023 • edited

Choose a reason for hiding this comment

cedrik-fuoco-adsk commented Mar 13, 2023

markreidvfx commented Mar 18, 2023

doug-walker commented Mar 18, 2023

remia left a comment

Choose a reason for hiding this comment

markreidvfx commented Apr 8, 2023

lgritz commented May 5, 2023

doug-walker commented May 5, 2023

markreidvfx commented May 30, 2023

lgritz commented May 30, 2023

doug-walker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markreidvfx Aug 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markreidvfx commented Aug 19, 2023

linux-foundation-easycla bot commented Sep 11, 2022 •

edited

markreidvfx commented Sep 18, 2022 •

edited

markreidvfx Apr 8, 2023 •

edited

markreidvfx Aug 19, 2023 •

edited