Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687

Merged

Conversation

markreidvfx
Copy link
Contributor

I'm still messing around with this but wanted to share a work in progress for some feedback.
This is based off of work I've done with 3d luts here #1681 and most of the code is ported from that project.

Here are some of the current performance results

ocioperf.exe --transform tests/data/files/clf/lut1d_32f_example.clf

lut1d

ocioperf.exe --transform tests/data/files/clf/lut3d_preview_tier_test.clf

lut3d

Supporting additional x86 SIMD instruction sets adds more complexity to the build system. Some of following things need to get considered.

  • is the instruction set enabled in the build?
  • is the instruction set supported by compiler and what are the flags to turn it on?
  • is the instruction set supported by cpu?

The really tricky bit is that what SIMD instruction sets a cpu has varies between models and brands. If a cpu encounters an instruction it doesn't have, a program will just crash. So you can't just turn on the AVX/AVX2 compiler flags for the whole build if you want to run on a wide variety of systems. Instead, each implementation is a separate compilation unit and the compiler flags are only used on that unit.
The cpuid instruction can be used at runtime to determine what instructions your cpu has and the best implementation can be chosen then, currently being done with a function pointer.

Some of the things I' currently thinking of doing

  • Cleanup and move CMake SIMD detection to separate cmake files
  • Put CPUInfo class in platform.cpp?
  • Change the file naming, was thinking something like this
    • ops/lut3d/x86/avx2.cpp
    • ops/lut3d/x86/avx.cpp
    • ops/lut3d/x86/sse2.cpp

The pull request is also very large so I was thinking I might break it into separate smaller requests.
Perhaps one for the AVX2/AVX build support/tests, one for lut3d and one for lut1d.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 11, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@doug-walker
Copy link
Collaborator

@markreidvfx, it's going to take some time for me to do a proper review, but I just wanted to say thank you so much for this PR! I especially appreciate the comments explaining the packing being used with the intrinsics and the many unit tests. It's really great to have you contributing to the project!

I think the naming is fine as is (leaving lut1d or lut3d in the module names) and it's fine to leave it in one big PR, if that's easiest.

@markreidvfx
Copy link
Contributor Author

markreidvfx commented Sep 18, 2022

Thanks @doug-walker :) I tried to add a lot test for the packing/unpacking because it can be tricky to get right, and I'm hoping the infrastructure could be use for adding SIMD acceleration to other ops in the future.

CMakeLists.txt Outdated Show resolved Hide resolved

#include "CPUInfo.h"

typedef void (Lut1DOpCPUApplyFunc)(const float *, const float *, const float *, int, const void *, void *, long);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that you were using a function point, just wondering if you thought of using std::function?
I think it is fair to use a function pointer here (for speed). The overhead from std::function might be unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::function isn't something I've used before, so not sure what the benefit would be? I typically avoid fancy c++ features when focusing on performance, haha, I'll take a look.

CMakeLists.txt Outdated Show resolved Hide resolved
@cedrik-fuoco-adsk
Copy link
Contributor

Thank you for the PR @markreidvfx, the implementation looks great! I commented on a few minor things and asked some questions.

  1. Cleanup and move CMake SIMD detection to separate cmake files

  2. Put CPUInfo class in platform.cpp?

  3. Change the file naming, was thinking something like this

    • ops/lut3d/x86/avx2.cpp
    • ops/lut3d/x86/avx.cpp
    • ops/lut3d/x86/sse2.cpp
  4. That would be preferable as I mentioned in the comments section.

  5. I don't necessarily have a strong opinion, but it is indeed a platform-specific thing, so that could make sense. (see comment in PR)

  6. I like the folder structure as it feels more organized, but I do think that both work and have different arguments going for them. I'm leaning a bit more toward the folder side as I find it a more structured, organized and simpler filename.

It is going to conflict a bit with the work done in Adsk Contrib - Add support for neon intrinsic #1775 but it shouldn't be too major.

@markreidvfx
Copy link
Contributor Author

Thanks for reviewing the pull request! Its been a while since I looked at this code, I'll take a deeper dive when I get a chance.

@doug-walker
Copy link
Collaborator

Awesome, please keep us posted, thanks Mark!

Copy link
Collaborator

@remia remia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work @markreidvfx, I haven't fully reviewed yet and probably lack the required experience to really contribute here but realised I left a couple of pending notes for a while now. I will try to get a closer look later.

CMakeLists.txt Outdated Show resolved Hide resolved
src/OpenColorIO/CPUInfo.cpp Outdated Show resolved Hide resolved
src/OpenColorIO/SSE2.h Outdated Show resolved Hide resolved
src/OpenColorIO/AVX.h Show resolved Hide resolved
@markreidvfx
Copy link
Contributor Author

Thanks again everyone for taking the time to review!
@cedrik-fuoco-adsk I'll move the cmake code to a separate file.
I'm going leave the rest of folder/file structure as it is for now, unless anyone objects.

@lgritz
Copy link
Collaborator

lgritz commented May 5, 2023

I don't want to derail this PR, it's really none of my business, but I thought it might be helpful to give a data point about how we handled SIMD in OIIO and OSL.

I think I'm reasonably savvy about hardware features? But I stumble over the intrinsics constantly, can't remember what they mean without looking each one up, and generally find code littered with intrinsics to be nearly impossible to maintain (not to mention that it must be repeated for each ISA you want to code). And code that uses the intrinsics is unreviewable by anybody not intimately familiar with the instruction sets and what each intrinsic does.

So the approach I took in OIIO is to hide it all behind intuitive vector classes in a single header file. This one header is the only place in the entire code base where a CPU-specific intrinsic can be found. The implementations -- be they SSE, AVX, NEON, as well as non-SIMD reference/fallback code -- are within each function or method, separated by appropriate #if guards. This means that places that use the intrinsics (i.e. use the classes) look generic, need no separate implementations for each CPU ISA, and are totally straighforward to read, understand, and review, without any knowledge of CPU intrinsics at all (as long as you trust the underlying class implementations).

Here is an example of how 4-wide SIMD is used to accelerate an "over" operation. Clear, yes? And no separate code needed for each ISA.

Here is an example that is even better, a fast implementation of exp2. Where are the intrinsics? They're all hidden behind the templating, because all the right functions and operators are overloaded, so you can say fast_exp2(float) or fast_exp2(vfloat4) or fast_exp2(vfloat8), or whatever. No fuss, no coding it separately for every ISA.

Last example, this time from OSL (which uses OIIO's simd.h, and please excuse the use of the old name, "float4" instead of the new name "vfloat4"), of how OSL implements Perlin noise with OIIO's simd classes. This works for both SSE and NEON, as well as fully non-SIMD on other architectures.

Honestly, there's nothing special about OIIO's simd.h. There are other implementations of SIMD vector classes out there that are roughly equivalent. But I want to make a case for the improved readability of restricting the literal reference to the intrinsic names to just one place in the code base, and wrap them with classes that make all the other scattered uses very intuitive, readable, maintainable even by non-experts, and templatable.

@doug-walker
Copy link
Collaborator

Thank you @lgritz , for bringing that to our attention! I agree, it's a much more readable and maintainable approach.

@markreidvfx
Copy link
Contributor Author

Sorry it took me so long to get back to this.
I'm very comfortable with intel intrinsics which is the main reason I did it this way. Personally, I like the simplicity of just using the intrinsic directly without needing to know inner workings of some complex wrapper. I also like being able to freely modified one platform with zero worry of effecting another. There are a few other reasons but I can totally see how people can be turned off by this approach.

I'll take a deeper look at OIIO simd header when I get some time but at first glance looks pretty straight forward. I'm not doing anything too fancy but also not sure how this would effect performance, without porting and measuring. Everything will need to be reworked if this is route we wish to take.

@lgritz out of curiosity, how does OIIO do single binary builds that support multiple x86 simd instruction sets dynamically?

@lgritz
Copy link
Collaborator

lgritz commented May 30, 2023

OIIO doesn't currently do single binary builds that support multiple ISAs. It's chosen at build time. OSL does have something relevant, though, where certain functions that are worth building with ISA-specific instructions are put into a secondary library and compiled separately for several ISAs, then the specific one is incorporated at runtime via dlopen'ing the one that corresponds to the hardware found.

Copy link
Collaborator

@doug-walker doug-walker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome, thanks again @markreidvfx !

As mentioned, we plan to include this in OCIO 2.3.0. Do you agree that we should remove the "Draft" flag from the PR? Is there anything else you think needs to be added right now?


__m256 next_r = _mm256_min_ps(lut_max, _mm256_add_ps(prev_r, one_f));
__m256 next_g = _mm256_min_ps(lut_max, _mm256_add_ps(prev_g, one_f));
__m256 next_b = _mm256_min_ps(lut_max, _mm256_add_ps(prev_b, one_f));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the unit tests are passing, then an input value of NaN is filtered to zero somewhere, as desired. It was more obvious in the previous SSE implementation where that happened. Where does that happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values are scaled and clamped before being passed to interp_tetrahedral.

The trick to clamping the NaNs to zero is to use max_ps(value, zero) before min_ps(value, max_value). It is also important for the second arg of the max_ps intrinsic to be the min/zero arg and not the input pixel value.

Here is a small test program showing it working on every possible float value.
https://godbolt.org/z/3439cvPe8

On a side note, I've noticed using this technique can causes issues when using sse2neon.h on clang. I believe it to be a bug in clang, but haven't reported it to them yet.
DLTcollab/sse2neon#606

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, it happens in the caller now.

I looked at your DLTcollab link, we are using different instructions for our min/max implementation in Neon. Please see this PR in SSE.h on lines 35-54.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at the PR see its using vmaxnmq_f32 which is the problem. The fmaxnm instruction only handles quiet NaNs and not the so call Signalling NaNs. I'll continue this discussion on that PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need/want to suppress Signalling NaNs, do we? My understanding is that arithmetic operations only generate Quiet NaNs and Signaling NaNs are only set programmatically (e.g. for debugging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think its a good idea to clamp them all to zero regardless in the LUT case. Especially since pixel values are user supplied and being used to calculate memory offsets.

src/OpenColorIO/ops/lut1d/Lut1DOpCPU.cpp Outdated Show resolved Hide resolved

###############################################################################
# Check if compiler supports X86 SIMD extensions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other cases, we have cmake try to compile a small sample program that uses the feature. Perhaps that would be more reliable than using check_cxx_compiler_flag? Cedrik offered to add this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that does sound like it might be more reliable. If it can be done in a separate PR that would be great.

@markreidvfx
Copy link
Contributor Author

I think its good to remove the Draft.

There is a better SSE2 fallback I'd like to add for cpu's that don't have the F16C extensions, but I can do that in a later pull request.

Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
@markreidvfx markreidvfx marked this pull request as ready for review August 20, 2023 04:02
@doug-walker doug-walker merged commit 9cc2486 into AcademySoftwareFoundation:main Aug 23, 2023
22 checks passed
brkglvn01 pushed a commit to brkglvn01/OpenColorIO that referenced this pull request Oct 23, 2023
…ion#1687)

* Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Fix a bunch of typos

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Remove USE_SSE code that is no longer needed

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Use alignas specifier

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Move x86 simd checking code to seperate file

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Fix cacheID test, compare lengths and everything but the cacheID hash

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Remove debug gather code

Signed-off-by: Mark Reid <mindmark@gmail.com>

* fixed outBD typo

Signed-off-by: Mark Reid <mindmark@gmail.com>

---------

Signed-off-by: Mark Reid <mindmark@gmail.com>
Co-authored-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Brooke <beg9562@rit.edu>
doug-walker added a commit to autodesk-forks/OpenColorIO that referenced this pull request Dec 6, 2023
…ion#1687)

* Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Fix a bunch of typos

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Remove USE_SSE code that is no longer needed

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Use alignas specifier

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Move x86 simd checking code to seperate file

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Fix cacheID test, compare lengths and everything but the cacheID hash

Signed-off-by: Mark Reid <mindmark@gmail.com>

* Remove debug gather code

Signed-off-by: Mark Reid <mindmark@gmail.com>

* fixed outBD typo

Signed-off-by: Mark Reid <mindmark@gmail.com>

---------

Signed-off-by: Mark Reid <mindmark@gmail.com>
Co-authored-by: Doug Walker <doug.walker@autodesk.com>
Signed-off-by: Doug Walker <Doug.Walker@autodesk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants