New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Speedup clip for floating point #26280
Conversation
numpy/_core/src/umath/clip.cpp
Outdated
if (x < min_val) x = min_val; | ||
if (x > max_val) x = max_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (x < min_val) x = min_val; | |
if (x > max_val) x = max_val; | |
if (x > max_val) x = max_val; | |
else if (x < min_val) x = min_val; |
(not sure whether this is really faster or more readable, but maybe worth a try)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/MTxasfG33
The version I provided has better generated code. (3 instructions and branchless)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tool! Is indeed better
Some small comments, but the implementation and performance gain look good! |
7ef941d
to
50dd260
Compare
We discussed this a recent triage meeting. This seems to make sense. Would you get the same speed improvement by using the |
50dd260
to
96ed0d2
Compare
The generated code is the same for -O2 and -O3.
I take this to mean that "long double" should also use this specialized path. I have added it and updated the asv output above according. Also added int32 and int64 benchmarks for comparison's sake. However "complex floating" types are still using the original path and thus the floating point error checks cannot be removed. |
I ran asv on another machine (Intel), and longdouble does get an improvement.
|
I have an arm64 chip (M2 Max), I just ran the same asv benchmark as above and got the following results
|
The idea was to effectively compile the whole file with it to see if the compiler isn't smart enough to do those some of those optimizations then. We also have But maybe it isn't and to really use it, it seems a bit like the ufunc build needs to be split into a default part a part that defaults to |
Given your godbolt code, you would have to look at something like this:
and I can't make out if the compiler lifts the NaN check out of the loop for the scalars there. It seems to do some surprisingly heavy auto-vectorization there. (I don't think the |
So we can compare the following 2 snippets at -O3: void clip_loop1(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
for (size_t k=0; k < nelems; k++) {
output[k] = _NPY_CLIP(input[k], min_val, max_val);
}
}
// prior to calling this, min_val and max_val have been verified to be non-nan
void clip_loop2(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
for (size_t k=0; k < nelems; k++) {
float x = input[k];
if (x < min_val) x = min_val;
if (x > max_val) x = max_val;
output[k] = x;
}
} We can then compare the SIMD-dified 4 at a time loop. main vectorized loop of
main vectorized loop of
For MSVC, only |
I tried annotating the 3 floating point functions like so. Based on inspection of the disassembly of the compiled object file, this does trigger the SIMD-dification. (For whatever reason, NPY_GCC_OPT_3 didn't work). NPY_NO_EXPORT void
__attribute__((optimize("O3")))
FLOAT_clip(char **args, npy_intp const *dimensions, npy_intp const *steps,
void *NPY_UNUSED(func))
{
...
} As reported by asv,
|
Thanks for checking, seems like the float special path is necessary for lifting the nan handling (at least I presume that is the mechanism here). Also nice to know that O3 doesn't actually help with the loop at all :). |
I got
|
At some point MSVC had a bug that it fails to inline |
You would be referring to #20134. But in-lining |
FWIW, clipping floats is about the one thing that is really useful, so a good speedup there seems OK to just duplicate the code in a simple way. Or does anyone disagree? It would be nice if someone more familiar with these tags here could chime in, just in case there is a nicer way to do the same thing, but if nobody does, should maybe just move on soon. The |
96ed0d2
to
c5d3d23
Compare
I have reduced somewhat the amount of code duplication. |
Based on the discussion in #26341, I have changed the pointers to I have also verified that keeping the "contiguous" code block is necessary to keep the good performance. |
There's a subtlely that results in different output when considering negative zeros. A negative zero compares equal to a positive zero. Assuming all operands are non-nan for the sake of this discussion, The main branch implements: x = x > vmin ? x : vmin;
x = x < vmax ? x : vmax; >>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([0., 0.]) Whereas this PR implements: x = x < vmin ? vmin : x;
x = x > vmax ? vmax : x; >>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([-0., 0.]) Note that the current documentation of
I have done some limited testing, and to emulate the behavior of the main branch would incur a loss in performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtlety that results in different output when considering negative zeros
Thanks for thinking of that subtlety! Since not C99 fmin
/fmax
doesn't guarantee handling of signed zeros (we can't use those because they don't propagate NaNs), I am happy to ignore it.
(min/max
also don't handle it specially, I think. I.e. it might be nice if we had min(-0, 0) == -0
and max(-0, 0) == 0
, but C99 doesn't make it easy and I am not sure if hardware supports it quickly normally.)
I suspect you could restore the old behavior by changing the checks to <=
and >=
and that would only make a difference if you have a lot of boundary values?
But honestly, it doesn't matter.
In either case, I am not sure I like a new bench file for this, but it seems OK. Some nitpicks, putting the stride branch earlier is nice if the code is more complex, but here it is so simple I am not sure it is worthwhile, so just mentioning it.
Thanks @pijyoi! I am happy with putting this in modulo a few nitpicks like the static inline
and putting braces.
self.dataout = np.full_like(self.array, 128) | ||
|
||
def time_clip(self, dtype, size): | ||
np.clip(self.array, 32, 224, self.dataout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love having a file just for clip, but the bench_ufuncs.py
is a weird monster and I am not sure clip can be integrated with the parametrized tests either. So OK with me.
Now this is where things differ between compilers With GCC (and clang), (https://godbolt.org/z/xxe3MGj5a) more assembler instructions are generated. This hurts the performance but the generated code handles the negatively zero correctly. i.e. With MSVC (and Intel compiler), (https://godbolt.org/z/neb3T54cG) the same assembler instructions are generated for <= as for <. So performance stays the same. But the generated code doesn't handle the negative zero correctly. i.e.
I guess not, since >>> np.sign(np.array([-0.0, 0.0]))
array([0., 0.]) |
Well, you should have to use
Hah, interesting. Too much info about instructions: So there seems to be |
Thanks @pijyoi also for fixing that striding bug, let's give this a shot. If someone has an idea of how to clean up the code a bit we can always follow up, plus I suspect that eventually this code will be changed a lot anyway (because it is based no using tags, and we rely on C++17 now, so it should be possible to do this simpler). |
np.clip
with floating point values is slower than it should be due to having to check and propagatenans
.This PR speeds up the operation by:
vmin
andvmax
are notnans
only once.nans
in the input data will naturally propagate without having to check explicitly.Sample output of
asv
on my laptop (WSL2 Ubuntu 22.04)I was not able to get
asv
to run on native Windows.But the gains (from other testing) were even larger because the
main
branch code was running slower on native Windows than on WSL2 to begin with (due to poorer codegen for MSVC).https://godbolt.org/z/TPoax4b9v