ENH: Speedup clip for floating point #26280

pijyoi · 2024-04-15T11:45:55Z

np.clip with floating point values is slower than it should be due to having to check and propagate nans.
This PR speeds up the operation by:

checking that vmin and vmax are not nans only once.
writing the clipping operation such that any nans in the input data will naturally propagate without having to check explicitly.

Sample output of asv on my laptop (WSL2 Ubuntu 22.04)

· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For numpy commit bb069125 <speedup-float-clip~1> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.....................................................
[ 0.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[12.50%] ··· Running (bench_clip.ClipFloat.time_clip--)..
[25.00%] · For numpy commit 3e1b55dc <speedup-float-clip> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.
[25.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[37.50%] ··· Running (bench_clip.ClipFloat.time_clip--)..
[50.00%] · For numpy commit 3e1b55dc <speedup-float-clip> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[62.50%] ··· bench_clip.ClipFloat.time_clip                                                                                        ok
[62.50%] ··· ================== ============= ============
             --                            size
             ------------------ --------------------------
                   dtype             100         100000
             ================== ============= ============
               numpy.float32     2.22±0.01μs   16.3±0.2μs
               numpy.float64     1.97±0.09μs    32.3±2μs
              numpy.longdouble    3.05±0.2μs    384±2μs
             ================== ============= ============

[75.00%] ··· bench_clip.ClipInteger.time_clip                                                                                      ok
[75.00%] ··· ============= ============= ============
             --                       size
             ------------- --------------------------
                 dtype          100         100000
             ============= ============= ============
              numpy.int32   2.27±0.02μs   25.2±0.3μs
              numpy.int64   2.00±0.01μs    76.7±1μs
             ============= ============= ============

[75.00%] · For numpy commit bb069125 <speedup-float-clip~1> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.
[75.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[87.50%] ··· bench_clip.ClipFloat.time_clip                                                                                        ok
[87.50%] ··· ================== ============= ============
             --                            size
             ------------------ --------------------------
                   dtype             100         100000
             ================== ============= ============
               numpy.float32     2.25±0.01μs   33.0±0.6μs
               numpy.float64     1.98±0.03μs    63.8±1μs
              numpy.longdouble   2.96±0.02μs    383±3μs
             ================== ============= ============

[100.00%] ··· bench_clip.ClipInteger.time_clip                                                                                      ok
[100.00%] ··· ============= ============= ============
              --                       size
              ------------- --------------------------
                  dtype          100         100000
              ============= ============= ============
               numpy.int32   2.28±0.01μs   24.9±0.1μs
               numpy.int64   2.01±0.03μs    75.4±2μs
              ============= ============= ============

| Change   | Before [bb069125] <speedup-float-clip~1>   | After [3e1b55dc] <speedup-float-clip>   |   Ratio | Benchmark (Parameter)                                           |
|----------|--------------------------------------------|-----------------------------------------|---------|-----------------------------------------------------------------|
| -        | 63.8±1μs                                   | 32.3±2μs                                |    0.51 | bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000) |
| -        | 33.0±0.6μs                                 | 16.3±0.2μs                              |    0.49 | bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000) |

I was not able to get asv to run on native Windows.

(numpy-dev) C:\work\repos\numpy>spin bench -c HEAD~1 -t bench_clip
$ cd benchmarks
$ asv continuous --factor 1.05 --bench bench_clip 5aa66561f991dd40914aebedacf5ee462b8dd560 7ef941d96a2814dc0c806457560d2a996b29515b
Error: [WinError 2] The system cannot find the file specified; aborting.

But the gains (from other testing) were even larger because the main branch code was running slower on native Windows than on WSL2 to begin with (due to poorer codegen for MSVC).
https://godbolt.org/z/TPoax4b9v

benchmarks/benchmarks/bench_clip.py

numpy/_core/src/umath/clip.cpp

eendebakpt · 2024-04-15T13:33:01Z

numpy/_core/src/umath/clip.cpp

+                    if (x < min_val) x = min_val;
+                    if (x > max_val) x = max_val;


Suggested change

if (x < min_val) x = min_val;

if (x > max_val) x = max_val;

if (x > max_val) x = max_val;

else if (x < min_val) x = min_val;

(not sure whether this is really faster or more readable, but maybe worth a try)

https://godbolt.org/z/MTxasfG33
The version I provided has better generated code. (3 instructions and branchless)

Nice tool! Is indeed better

eendebakpt · 2024-04-15T13:38:15Z

Some small comments, but the implementation and performance gain look good!

mattip · 2024-04-17T19:44:36Z

We discussed this a recent triage meeting. This seems to make sense. Would you get the same speed improvement by using the -O3 macro? If we stay with this specialization you could also remove the floating point error checks from the other loop.

pijyoi · 2024-04-17T22:14:18Z

Would you get the same speed improvement by using the -O3 macro?

The generated code is the same for -O2 and -O3.
-O2 : https://godbolt.org/z/53crsc1hG
-O3 : https://godbolt.org/z/Ys1eMx4Ej

If we stay with this specialization you could also remove the floating point error checks from the other loop.

I take this to mean that "long double" should also use this specialized path. I have added it and updated the asv output above according. Also added int32 and int64 benchmarks for comparison's sake.
So it turns out that "long double" which is fp128 on Linux doesn't get any speedup on my machine.

However "complex floating" types are still using the original path and thus the floating point error checks cannot be removed.

pijyoi · 2024-04-18T00:30:23Z

I ran asv on another machine (Intel), and longdouble does get an improvement.
(The machine where it didn't have an improvement was an AMD)

Change	Before [`c2a9a1c`] <speedup-float-clip~1>	After [`96ed0d2`]	Ratio	Benchmark (Parameter)
-	308±0.7μs	228±0.5μs	0.74	bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)
-	47.5±2μs	25.7±0.5μs	0.54	bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
-	22.6±0.2μs	12.1±0.1μs	0.53	bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)

j9ac9k · 2024-04-18T03:00:28Z

I have an arm64 chip (M2 Max), I just ran the same asv benchmark as above and got the following results

Change	Before [`c2a9a1c`] <speedup-float-clip~1>	After [`96ed0d2`]	Ratio	Benchmark (Parameter)
-	1.13±0.03μs	1.08±0μs	0.95	bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100)
-	90.1±0.9μs	60.3±0.3μs	0.67	bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)
-	90.6±0.3μs	60.3±0.1μs	0.67	bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
-	92.2±2μs	60.7±0.4μs	0.66	bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)

seberg · 2024-04-18T13:25:28Z

The generated code is the same for -O2 and -O3.

The idea was to effectively compile the whole file with it to see if the compiler isn't smart enough to do those some of those optimizations then. We also have NPY_GCC_OPT_3 which is a function attribute. Unfortunately, that really only works with GCC, though.

But maybe it isn't and to really use it, it seems a bit like the ufunc build needs to be split into a default part a part that defaults to -O3 for loops where we consider it worthwhile...

seberg · 2024-04-18T13:43:11Z

Given your godbolt code, you would have to look at something like this:

void clip(char *x, size_t n, float min_val, float max_val, char *res, size_t stride)
{
    while (n--) {
        *(float *)res = _NPY_CLIP(*(float *)x, min_val, max_val);
        res +=stride;
        x +=stride;
    }
}

and I can't make out if the compiler lifts the NaN check out of the loop for the scalars there. It seems to do some surprisingly heavy auto-vectorization there. (I don't think the char * is needed, but matches the current code closer.)

pijyoi · 2024-04-18T14:36:22Z

So we can compare the following 2 snippets at -O3:
https://godbolt.org/z/f7ofbWdqr

void clip_loop1(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
    for (size_t k=0; k < nelems; k++) {
        output[k] = _NPY_CLIP(input[k], min_val, max_val);
    }
}

// prior to calling this, min_val and max_val have been verified to be non-nan
void clip_loop2(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
    for (size_t k=0; k < nelems; k++) {
        float x = input[k];
        if (x < min_val) x = min_val;
        if (x > max_val) x = max_val;
        output[k] = x;
    }
}

We can then compare the SIMD-dified 4 at a time loop.

main vectorized loop of clip_loop1
Large portions of the code is masking and selecting.

.L5:
        movups  xmm3, XMMWORD PTR [rcx+rax]
        movaps  xmm0, xmm6
        movaps  xmm2, xmm6
        movaps  xmm4, xmm3
        cmpltps xmm0, xmm3
        cmpordps        xmm4, xmm3
        pandn   xmm0, xmm4
        andps   xmm2, xmm0
        andnps  xmm0, xmm3
        orps    xmm0, xmm2
        movaps  xmm2, xmm0
        cmpltps xmm2, xmm5
        movdqa  xmm8, xmm2
        pandn   xmm2, xmm4
        pand    xmm8, xmm4
        movaps  xmm4, xmm3
        cmpunordps      xmm4, xmm3
        andps   xmm3, xmm4
        andnps  xmm4, xmm5
        orps    xmm4, xmm3
        movaps  xmm3, xmm5
        andps   xmm3, xmm2
        andnps  xmm2, xmm4
        orps    xmm2, xmm3
        movaps  xmm3, xmm0
        movaps  xmm0, xmm8
        andps   xmm3, xmm8
        andnps  xmm0, xmm2
        orps    xmm0, xmm3
        movups  XMMWORD PTR [rsi+rax], xmm0
        add     rax, 16
        cmp     rax, rdi
        jne     .L5

main vectorized loop of clip_loop2
In comparison, this code is simple and looks just like the non-vectorized version.

.L47:
        movups  xmm6, XMMWORD PTR [rcx+rax]
        movaps  xmm3, xmm5
        movaps  xmm2, xmm4
        maxps   xmm3, xmm6
        minps   xmm2, xmm3
        movups  XMMWORD PTR [rsi+rax], xmm2
        add     rax, 16
        cmp     rdi, rax
        jne     .L47

For MSVC, only clip_loop2 gets SIMD-dified.

pijyoi · 2024-04-19T04:54:44Z

I tried annotating the 3 floating point functions like so. Based on inspection of the disassembly of the compiled object file, this does trigger the SIMD-dification. (For whatever reason, NPY_GCC_OPT_3 didn't work).

NPY_NO_EXPORT void
__attribute__((optimize("O3")))
FLOAT_clip(char **args, npy_intp const *dimensions, npy_intp const *steps,
           void *NPY_UNUSED(func))
{
    ...
}

As reported by asv,

turning on -O3 did not improve the original timings of the main branch
turning on -O3 did not further improve the timings of this PR

seberg · 2024-04-19T06:32:35Z

Thanks for checking, seems like the float special path is necessary for lifting the nan handling (at least I presume that is the mechanism here). Also nice to know that O3 doesn't actually help with the loop at all :).
Might still wonder if we can integrate that without having to duplicating the code, but not sure and I am OK with it also.

pijyoi · 2024-04-20T01:53:51Z

I got spin bench to run on native Windows (same AMD machine, so the timings can be compared directly against the results in the top-most post) .
As can be seen, main branch clip timings are much worse on Windows (MSVC) than on Linux (GCC); by almost a factor of 4 for float32.
This PR has made the float32 and float64 timings equivalent on both MSVC and GCC.

Change	Before [`c2a9a1c`] <speedup-float-clip~1>	After [`96ed0d2`]	Ratio	Benchmark (Parameter)
-	16.4±1μs	14.5±0.2μs	0.88	bench_clip.ClipInteger.time_clip(<class 'numpy.int32'>, 100000)
-	125±3μs	29.7±0.4μs	0.24	bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
-	128±3μs	30.7±0.6μs	0.24	bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)
-	124±6μs	18.3±0.8μs	0.15	bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)

seberg · 2024-04-20T07:30:18Z

clip timings are much worse on Windows

At some point MSVC had a bug that it fails to inline isnan, I guess it is still there, this PR pulls out the nan check, so that failure to optimize wouldn't be as big of a hit.

pijyoi · 2024-04-20T11:36:22Z

You would be referring to #20134. But in-lining isnan has been in place since VS 2019, which can be observed conveniently with godbolt.

seberg · 2024-04-24T14:24:45Z

FWIW, clipping floats is about the one thing that is really useful, so a good speedup there seems OK to just duplicate the code in a simple way. Or does anyone disagree?

It would be nice if someone more familiar with these tags here could chime in, just in case there is a nicer way to do the same thing, but if nobody does, should maybe just move on soon.

The npy_intp is1 = steps[0] / sizeof(T), os1 = steps[3] / sizeof(T); line seems wrong, but maybe it is better to follow up on that, either way.

pijyoi · 2024-04-27T11:23:21Z

I have reduced somewhat the amount of code duplication.

pijyoi · 2024-04-28T14:36:27Z

The npy_intp is1 = steps[0] / sizeof(T), os1 = steps[3] / sizeof(T); line seems wrong, but maybe it is better to follow up on that, either way.

Based on the discussion in #26341, I have changed the pointers to char* and just increment the pointers by the byte step size.

I have also verified that keeping the "contiguous" code block is necessary to keep the good performance.
So the current code logic is that better performance is obtained when the step size happens to equal sizeof(T), which should be the common case.

pijyoi · 2024-05-02T06:59:57Z

There's a subtlely that results in different output when considering negative zeros. A negative zero compares equal to a positive zero.

Assuming all operands are non-nan for the sake of this discussion,

The main branch implements:
(When x equals the boundary value, the boundary value is used)

x = x > vmin ? x : vmin;
x = x < vmax ? x : vmax;

>>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([0., 0.])

Whereas this PR implements:
(When x equals the boundary value, x is used)

x = x < vmin ? vmin : x;
x = x > vmax ? vmax : x;

>>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([-0.,  0.])

Note that the current documentation of np.clip (wrongly) claims to implement the behavior of this PR

Returns: An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.

I have done some limited testing, and to emulate the behavior of the main branch would incur a loss in performance.

seberg

There's a subtlety that results in different output when considering negative zeros

Thanks for thinking of that subtlety! Since not C99 fmin/fmax doesn't guarantee handling of signed zeros (we can't use those because they don't propagate NaNs), I am happy to ignore it.
(min/max also don't handle it specially, I think. I.e. it might be nice if we had min(-0, 0) == -0 and max(-0, 0) == 0, but C99 doesn't make it easy and I am not sure if hardware supports it quickly normally.)

I suspect you could restore the old behavior by changing the checks to <= and >= and that would only make a difference if you have a lot of boundary values?
But honestly, it doesn't matter.

In either case, I am not sure I like a new bench file for this, but it seems OK. Some nitpicks, putting the stride branch earlier is nice if the code is more complex, but here it is so simple I am not sure it is worthwhile, so just mentioning it.

Thanks @pijyoi! I am happy with putting this in modulo a few nitpicks like the static inline and putting braces.

seberg · 2024-05-02T07:33:29Z

benchmarks/benchmarks/bench_clip.py

+        self.dataout = np.full_like(self.array, 128)
+
+    def time_clip(self, dtype, size):
+        np.clip(self.array, 32, 224, self.dataout)


I don't love having a file just for clip, but the bench_ufuncs.py is a weird monster and I am not sure clip can be integrated with the parametrized tests either. So OK with me.

numpy/_core/src/umath/clip.cpp

pijyoi · 2024-05-02T12:52:57Z

I suspect you could restore the old behavior by changing the checks to <= and >= and that would only make a difference if you have a lot of boundary values?

Now this is where things differ between compilers

With GCC (and clang), (https://godbolt.org/z/xxe3MGj5a) more assembler instructions are generated. This hurts the performance but the generated code handles the negatively zero correctly. i.e. np.clip(np.array([-0.0, 0.0]), 0, 1) - > array([0., 0.])

With MSVC (and Intel compiler), (https://godbolt.org/z/neb3T54cG) the same assembler instructions are generated for <= as for <. So performance stays the same. But the generated code doesn't handle the negative zero correctly. i.e. np.clip(np.array([-0.0, 0.0]), 0, 1) -> array([-0., 0.]). Or rather, the compiler treats negative zero the same as positive zero, and hence generated the same code.

But honestly, it doesn't matter.

I guess not, since

>>> np.sign(np.array([-0.0, 0.0]))
array([0., 0.])

seberg · 2024-05-02T13:24:33Z

np.sign(np.array([-0.0, 0.0]))

Well, you should have to use np.signbit, although I guess we could make np.sign return -0. also.
I would care about it if I would think it was viable, but as you found nicely, MSVC seems sloppy and the available hardware instructions mean you probably end up with bad performance trade-offs if you want to get it right...

Now this is where things differ between compilers ...

Hah, interesting. Too much info about instructions: So there seems to be maxss and minss instructions available on x64 and they propagate the second operand when either is NaN or when they are equal (i.e. signed 0s). So using >= is't compatible with that as we have to propagate the input on NaN and that means we must propagate the input if equal...

seberg · 2024-05-02T14:40:59Z

Thanks @pijyoi also for fixing that striding bug, let's give this a shot. If someone has an idea of how to clean up the code a bit we can always follow up, plus I suspect that eventually this code will be changed a lot anyway (because it is based no using tags, and we rely on C++17 now, so it should be possible to do this simpler).

github-actions bot added the 01 - Enhancement label Apr 15, 2024

eendebakpt reviewed Apr 15, 2024

View reviewed changes

pijyoi force-pushed the speedup-float-clip branch from 7ef941d to 50dd260 Compare April 15, 2024 14:54

ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Apr 16, 2024

pijyoi force-pushed the speedup-float-clip branch from 50dd260 to 96ed0d2 Compare April 17, 2024 22:13

ngoldbaum added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Apr 18, 2024

seberg mentioned this pull request Apr 24, 2024

BUG: Clipping code assumes alignment % itemsize == 0 #26341

Closed

add benchmark for np.clip

9078d30

pijyoi force-pushed the speedup-float-clip branch from 96ed0d2 to c5d3d23 Compare April 27, 2024 11:07

seberg reviewed May 2, 2024

View reviewed changes

specialize float clip

9c8ed4e

pijyoi force-pushed the speedup-float-clip branch from 024c4f9 to 9c8ed4e Compare May 2, 2024 11:48

seberg merged commit aa0cc04 into numpy:main May 2, 2024
65 checks passed

pijyoi deleted the speedup-float-clip branch May 2, 2024 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Speedup clip for floating point #26280

ENH: Speedup clip for floating point #26280

pijyoi commented Apr 15, 2024 •

edited

eendebakpt Apr 15, 2024

pijyoi Apr 15, 2024

eendebakpt Apr 15, 2024

eendebakpt commented Apr 15, 2024

mattip commented Apr 17, 2024

pijyoi commented Apr 17, 2024

pijyoi commented Apr 18, 2024

j9ac9k commented Apr 18, 2024

seberg commented Apr 18, 2024

seberg commented Apr 18, 2024

pijyoi commented Apr 18, 2024 •

edited

pijyoi commented Apr 19, 2024

seberg commented Apr 19, 2024

pijyoi commented Apr 20, 2024

seberg commented Apr 20, 2024

pijyoi commented Apr 20, 2024

seberg commented Apr 24, 2024

pijyoi commented Apr 27, 2024

pijyoi commented Apr 28, 2024

pijyoi commented May 2, 2024

seberg left a comment

seberg May 2, 2024

pijyoi commented May 2, 2024

seberg commented May 2, 2024

seberg commented May 2, 2024

ENH: Speedup clip for floating point #26280

ENH: Speedup clip for floating point #26280

Conversation

pijyoi commented Apr 15, 2024 • edited

eendebakpt Apr 15, 2024

Choose a reason for hiding this comment

pijyoi Apr 15, 2024

Choose a reason for hiding this comment

eendebakpt Apr 15, 2024

Choose a reason for hiding this comment

eendebakpt commented Apr 15, 2024

mattip commented Apr 17, 2024

pijyoi commented Apr 17, 2024

pijyoi commented Apr 18, 2024

j9ac9k commented Apr 18, 2024

seberg commented Apr 18, 2024

seberg commented Apr 18, 2024

pijyoi commented Apr 18, 2024 • edited

pijyoi commented Apr 19, 2024

seberg commented Apr 19, 2024

pijyoi commented Apr 20, 2024

seberg commented Apr 20, 2024

pijyoi commented Apr 20, 2024

seberg commented Apr 24, 2024

pijyoi commented Apr 27, 2024

pijyoi commented Apr 28, 2024

pijyoi commented May 2, 2024

seberg left a comment

Choose a reason for hiding this comment

seberg May 2, 2024

Choose a reason for hiding this comment

pijyoi commented May 2, 2024

seberg commented May 2, 2024

seberg commented May 2, 2024

pijyoi commented Apr 15, 2024 •

edited

pijyoi commented Apr 18, 2024 •

edited