-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687
Add AVX2/AVX/SSE2 SIMD accelerated 1D/3D LUTS #1687
Conversation
@markreidvfx, it's going to take some time for me to do a proper review, but I just wanted to say thank you so much for this PR! I especially appreciate the comments explaining the packing being used with the intrinsics and the many unit tests. It's really great to have you contributing to the project! I think the naming is fine as is (leaving lut1d or lut3d in the module names) and it's fine to leave it in one big PR, if that's easiest. |
Thanks @doug-walker :) I tried to add a lot test for the packing/unpacking because it can be tricky to get right, and I'm hoping the infrastructure could be use for adding SIMD acceleration to other ops in the future. |
|
||
#include "CPUInfo.h" | ||
|
||
typedef void (Lut1DOpCPUApplyFunc)(const float *, const float *, const float *, int, const void *, void *, long); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that you were using a function point, just wondering if you thought of using std::function?
I think it is fair to use a function pointer here (for speed). The overhead from std::function might be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::function
isn't something I've used before, so not sure what the benefit would be? I typically avoid fancy c++ features when focusing on performance, haha, I'll take a look.
Thank you for the PR @markreidvfx, the implementation looks great! I commented on a few minor things and asked some questions.
It is going to conflict a bit with the work done in Adsk Contrib - Add support for neon intrinsic #1775 but it shouldn't be too major. |
Thanks for reviewing the pull request! Its been a while since I looked at this code, I'll take a deeper dive when I get a chance. |
Awesome, please keep us posted, thanks Mark! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work @markreidvfx, I haven't fully reviewed yet and probably lack the required experience to really contribute here but realised I left a couple of pending notes for a while now. I will try to get a closer look later.
Thanks again everyone for taking the time to review! |
fad8220
to
370dad8
Compare
a98d48b
to
aae4470
Compare
I don't want to derail this PR, it's really none of my business, but I thought it might be helpful to give a data point about how we handled SIMD in OIIO and OSL. I think I'm reasonably savvy about hardware features? But I stumble over the intrinsics constantly, can't remember what they mean without looking each one up, and generally find code littered with intrinsics to be nearly impossible to maintain (not to mention that it must be repeated for each ISA you want to code). And code that uses the intrinsics is unreviewable by anybody not intimately familiar with the instruction sets and what each intrinsic does. So the approach I took in OIIO is to hide it all behind intuitive vector classes in a single header file. This one header is the only place in the entire code base where a CPU-specific intrinsic can be found. The implementations -- be they SSE, AVX, NEON, as well as non-SIMD reference/fallback code -- are within each function or method, separated by appropriate Here is an example of how 4-wide SIMD is used to accelerate an "over" operation. Clear, yes? And no separate code needed for each ISA. Here is an example that is even better, a fast implementation of exp2. Where are the intrinsics? They're all hidden behind the templating, because all the right functions and operators are overloaded, so you can say Last example, this time from OSL (which uses OIIO's simd.h, and please excuse the use of the old name, "float4" instead of the new name "vfloat4"), of how OSL implements Perlin noise with OIIO's simd classes. This works for both SSE and NEON, as well as fully non-SIMD on other architectures. Honestly, there's nothing special about OIIO's simd.h. There are other implementations of SIMD vector classes out there that are roughly equivalent. But I want to make a case for the improved readability of restricting the literal reference to the intrinsic names to just one place in the code base, and wrap them with classes that make all the other scattered uses very intuitive, readable, maintainable even by non-experts, and templatable. |
Thank you @lgritz , for bringing that to our attention! I agree, it's a much more readable and maintainable approach. |
Sorry it took me so long to get back to this. I'll take a deeper look at OIIO simd header when I get some time but at first glance looks pretty straight forward. I'm not doing anything too fancy but also not sure how this would effect performance, without porting and measuring. Everything will need to be reworked if this is route we wish to take. @lgritz out of curiosity, how does OIIO do single binary builds that support multiple x86 simd instruction sets dynamically? |
OIIO doesn't currently do single binary builds that support multiple ISAs. It's chosen at build time. OSL does have something relevant, though, where certain functions that are worth building with ISA-specific instructions are put into a secondary library and compiled separately for several ISAs, then the specific one is incorporated at runtime via dlopen'ing the one that corresponds to the hardware found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really awesome, thanks again @markreidvfx !
As mentioned, we plan to include this in OCIO 2.3.0. Do you agree that we should remove the "Draft" flag from the PR? Is there anything else you think needs to be added right now?
|
||
__m256 next_r = _mm256_min_ps(lut_max, _mm256_add_ps(prev_r, one_f)); | ||
__m256 next_g = _mm256_min_ps(lut_max, _mm256_add_ps(prev_g, one_f)); | ||
__m256 next_b = _mm256_min_ps(lut_max, _mm256_add_ps(prev_b, one_f)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the unit tests are passing, then an input value of NaN is filtered to zero somewhere, as desired. It was more obvious in the previous SSE implementation where that happened. Where does that happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values are scaled and clamped before being passed to interp_tetrahedral
.
The trick to clamping the NaNs to zero is to use max_ps(value, zero)
before min_ps(value, max_value)
. It is also important for the second arg of the max_ps
intrinsic to be the min/zero arg and not the input pixel value.
Here is a small test program showing it working on every possible float value.
https://godbolt.org/z/3439cvPe8
On a side note, I've noticed using this technique can causes issues when using sse2neon.h
on clang. I believe it to be a bug in clang, but haven't reported it to them yet.
DLTcollab/sse2neon#606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right, it happens in the caller now.
I looked at your DLTcollab link, we are using different instructions for our min/max implementation in Neon. Please see this PR in SSE.h on lines 35-54.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at the PR see its using vmaxnmq_f32
which is the problem. The fmaxnm
instruction only handles quiet NaNs and not the so call Signalling NaNs. I'll continue this discussion on that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need/want to suppress Signalling NaNs, do we? My understanding is that arithmetic operations only generate Quiet NaNs and Signaling NaNs are only set programmatically (e.g. for debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think its a good idea to clamp them all to zero regardless in the LUT case. Especially since pixel values are user supplied and being used to calculate memory offsets.
|
||
############################################################################### | ||
# Check if compiler supports X86 SIMD extensions | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other cases, we have cmake try to compile a small sample program that uses the feature. Perhaps that would be more reliable than using check_cxx_compiler_flag? Cedrik offered to add this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that does sound like it might be more reliable. If it can be done in a separate PR that would be great.
I think its good to remove the There is a better SSE2 fallback I'd like to add for cpu's that don't have the F16C extensions, but I can do that in a later pull request. |
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
Signed-off-by: Mark Reid <mindmark@gmail.com>
aae4470
to
33810d3
Compare
…ion#1687) * Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix a bunch of typos Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove USE_SSE code that is no longer needed Signed-off-by: Mark Reid <mindmark@gmail.com> * Use alignas specifier Signed-off-by: Mark Reid <mindmark@gmail.com> * Move x86 simd checking code to seperate file Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix cacheID test, compare lengths and everything but the cacheID hash Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove debug gather code Signed-off-by: Mark Reid <mindmark@gmail.com> * fixed outBD typo Signed-off-by: Mark Reid <mindmark@gmail.com> --------- Signed-off-by: Mark Reid <mindmark@gmail.com> Co-authored-by: Doug Walker <doug.walker@autodesk.com> Signed-off-by: Brooke <beg9562@rit.edu>
…ion#1687) * Add AVX2/AVX/SSE2 accelerated pack/unpacking function templates Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated Lut3D Tetrahedral implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Add AVX2/AVX/SSE2 accelerated linear Lut1D implementations Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix a bunch of typos Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove USE_SSE code that is no longer needed Signed-off-by: Mark Reid <mindmark@gmail.com> * Use alignas specifier Signed-off-by: Mark Reid <mindmark@gmail.com> * Move x86 simd checking code to seperate file Signed-off-by: Mark Reid <mindmark@gmail.com> * Fix cacheID test, compare lengths and everything but the cacheID hash Signed-off-by: Mark Reid <mindmark@gmail.com> * Remove debug gather code Signed-off-by: Mark Reid <mindmark@gmail.com> * fixed outBD typo Signed-off-by: Mark Reid <mindmark@gmail.com> --------- Signed-off-by: Mark Reid <mindmark@gmail.com> Co-authored-by: Doug Walker <doug.walker@autodesk.com> Signed-off-by: Doug Walker <Doug.Walker@autodesk.com>
I'm still messing around with this but wanted to share a work in progress for some feedback.
This is based off of work I've done with 3d luts here #1681 and most of the code is ported from that project.
Here are some of the current performance results
ocioperf.exe --transform tests/data/files/clf/lut1d_32f_example.clf
ocioperf.exe --transform tests/data/files/clf/lut3d_preview_tier_test.clf
Supporting additional x86 SIMD instruction sets adds more complexity to the build system. Some of following things need to get considered.
The really tricky bit is that what SIMD instruction sets a cpu has varies between models and brands. If a cpu encounters an instruction it doesn't have, a program will just crash. So you can't just turn on the AVX/AVX2 compiler flags for the whole build if you want to run on a wide variety of systems. Instead, each implementation is a separate compilation unit and the compiler flags are only used on that unit.
The
cpuid
instruction can be used at runtime to determine what instructions your cpu has and the best implementation can be chosen then, currently being done with a function pointer.Some of the things I' currently thinking of doing
The pull request is also very large so I was thinking I might break it into separate smaller requests.
Perhaps one for the AVX2/AVX build support/tests, one for lut3d and one for lut1d.