Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Speedup clip for floating point #26280

Merged
merged 2 commits into from May 2, 2024
Merged

Conversation

pijyoi
Copy link
Contributor

@pijyoi pijyoi commented Apr 15, 2024

np.clip with floating point values is slower than it should be due to having to check and propagate nans.
This PR speeds up the operation by:

  1. checking that vmin and vmax are not nans only once.
  2. writing the clipping operation such that any nans in the input data will naturally propagate without having to check explicitly.

Sample output of asv on my laptop (WSL2 Ubuntu 22.04)

· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For numpy commit bb069125 <speedup-float-clip~1> (round 1/2):
[ 0.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.....................................................
[ 0.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[12.50%] ··· Running (bench_clip.ClipFloat.time_clip--)..
[25.00%] · For numpy commit 3e1b55dc <speedup-float-clip> (round 1/2):
[25.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.
[25.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[37.50%] ··· Running (bench_clip.ClipFloat.time_clip--)..
[50.00%] · For numpy commit 3e1b55dc <speedup-float-clip> (round 2/2):
[50.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[62.50%] ··· bench_clip.ClipFloat.time_clip                                                                                        ok
[62.50%] ··· ================== ============= ============
             --                            size
             ------------------ --------------------------
                   dtype             100         100000
             ================== ============= ============
               numpy.float32     2.22±0.01μs   16.3±0.2μs
               numpy.float64     1.97±0.09μs    32.3±2μs
              numpy.longdouble    3.05±0.2μs    384±2μs
             ================== ============= ============

[75.00%] ··· bench_clip.ClipInteger.time_clip                                                                                      ok
[75.00%] ··· ============= ============= ============
             --                       size
             ------------- --------------------------
                 dtype          100         100000
             ============= ============= ============
              numpy.int32   2.27±0.02μs   25.2±0.3μs
              numpy.int64   2.00±0.01μs    76.7±1μs
             ============= ============= ============

[75.00%] · For numpy commit bb069125 <speedup-float-clip~1> (round 2/2):
[75.00%] ·· Building for virtualenv-py3.10-Cython-build-packaging.
[75.00%] ·· Benchmarking virtualenv-py3.10-Cython-build-packaging
[87.50%] ··· bench_clip.ClipFloat.time_clip                                                                                        ok
[87.50%] ··· ================== ============= ============
             --                            size
             ------------------ --------------------------
                   dtype             100         100000
             ================== ============= ============
               numpy.float32     2.25±0.01μs   33.0±0.6μs
               numpy.float64     1.98±0.03μs    63.8±1μs
              numpy.longdouble   2.96±0.02μs    383±3μs
             ================== ============= ============

[100.00%] ··· bench_clip.ClipInteger.time_clip                                                                                      ok
[100.00%] ··· ============= ============= ============
              --                       size
              ------------- --------------------------
                  dtype          100         100000
              ============= ============= ============
               numpy.int32   2.28±0.01μs   24.9±0.1μs
               numpy.int64   2.01±0.03μs    75.4±2μs
              ============= ============= ============

| Change   | Before [bb069125] <speedup-float-clip~1>   | After [3e1b55dc] <speedup-float-clip>   |   Ratio | Benchmark (Parameter)                                           |
|----------|--------------------------------------------|-----------------------------------------|---------|-----------------------------------------------------------------|
| -        | 63.8±1μs                                   | 32.3±2μs                                |    0.51 | bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000) |
| -        | 33.0±0.6μs                                 | 16.3±0.2μs                              |    0.49 | bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000) |

I was not able to get asv to run on native Windows.

(numpy-dev) C:\work\repos\numpy>spin bench -c HEAD~1 -t bench_clip
$ cd benchmarks
$ asv continuous --factor 1.05 --bench bench_clip 5aa66561f991dd40914aebedacf5ee462b8dd560 7ef941d96a2814dc0c806457560d2a996b29515b
Error: [WinError 2] The system cannot find the file specified; aborting.

But the gains (from other testing) were even larger because the main branch code was running slower on native Windows than on WSL2 to begin with (due to poorer codegen for MSVC).
https://godbolt.org/z/TPoax4b9v

benchmarks/benchmarks/bench_clip.py Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
Comment on lines 229 to 230
if (x < min_val) x = min_val;
if (x > max_val) x = max_val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (x < min_val) x = min_val;
if (x > max_val) x = max_val;
if (x > max_val) x = max_val;
else if (x < min_val) x = min_val;

(not sure whether this is really faster or more readable, but maybe worth a try)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/MTxasfG33
The version I provided has better generated code. (3 instructions and branchless)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tool! Is indeed better

@eendebakpt
Copy link
Contributor

Some small comments, but the implementation and performance gain look good!

@ngoldbaum ngoldbaum added the triage review Issue/PR to be discussed at the next triage meeting label Apr 16, 2024
@mattip
Copy link
Member

mattip commented Apr 17, 2024

We discussed this a recent triage meeting. This seems to make sense. Would you get the same speed improvement by using the -O3 macro? If we stay with this specialization you could also remove the floating point error checks from the other loop.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 17, 2024

Would you get the same speed improvement by using the -O3 macro?

The generated code is the same for -O2 and -O3.
-O2 : https://godbolt.org/z/53crsc1hG
-O3 : https://godbolt.org/z/Ys1eMx4Ej

If we stay with this specialization you could also remove the floating point error checks from the other loop.

I take this to mean that "long double" should also use this specialized path. I have added it and updated the asv output above according. Also added int32 and int64 benchmarks for comparison's sake.
So it turns out that "long double" which is fp128 on Linux doesn't get any speedup on my machine.

However "complex floating" types are still using the original path and thus the floating point error checks cannot be removed.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 18, 2024

I ran asv on another machine (Intel), and longdouble does get an improvement.
(The machine where it didn't have an improvement was an AMD)

Change Before [c2a9a1c] <speedup-float-clip~1> After [96ed0d2] Ratio Benchmark (Parameter)
- 308±0.7μs 228±0.5μs 0.74 bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)
- 47.5±2μs 25.7±0.5μs 0.54 bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
- 22.6±0.2μs 12.1±0.1μs 0.53 bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)

@j9ac9k
Copy link

j9ac9k commented Apr 18, 2024

I have an arm64 chip (M2 Max), I just ran the same asv benchmark as above and got the following results

Change Before [c2a9a1c] <speedup-float-clip~1> After [96ed0d2] Ratio Benchmark (Parameter)
- 1.13±0.03μs 1.08±0μs 0.95 bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100)
- 90.1±0.9μs 60.3±0.3μs 0.67 bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)
- 90.6±0.3μs 60.3±0.1μs 0.67 bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
- 92.2±2μs 60.7±0.4μs 0.66 bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)

@seberg
Copy link
Member

seberg commented Apr 18, 2024

The generated code is the same for -O2 and -O3.

The idea was to effectively compile the whole file with it to see if the compiler isn't smart enough to do those some of those optimizations then. We also have NPY_GCC_OPT_3 which is a function attribute. Unfortunately, that really only works with GCC, though.

But maybe it isn't and to really use it, it seems a bit like the ufunc build needs to be split into a default part a part that defaults to -O3 for loops where we consider it worthwhile...

@seberg
Copy link
Member

seberg commented Apr 18, 2024

Given your godbolt code, you would have to look at something like this:

void clip(char *x, size_t n, float min_val, float max_val, char *res, size_t stride)
{
    while (n--) {
        *(float *)res = _NPY_CLIP(*(float *)x, min_val, max_val);
        res +=stride;
        x +=stride;
    }
}

and I can't make out if the compiler lifts the NaN check out of the loop for the scalars there. It seems to do some surprisingly heavy auto-vectorization there. (I don't think the char * is needed, but matches the current code closer.)

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 18, 2024

So we can compare the following 2 snippets at -O3:
https://godbolt.org/z/f7ofbWdqr

void clip_loop1(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
    for (size_t k=0; k < nelems; k++) {
        output[k] = _NPY_CLIP(input[k], min_val, max_val);
    }
}

// prior to calling this, min_val and max_val have been verified to be non-nan
void clip_loop2(const float *input, float *output, size_t nelems, float min_val, float max_val)
{
    for (size_t k=0; k < nelems; k++) {
        float x = input[k];
        if (x < min_val) x = min_val;
        if (x > max_val) x = max_val;
        output[k] = x;
    }
}

We can then compare the SIMD-dified 4 at a time loop.

main vectorized loop of clip_loop1
Large portions of the code is masking and selecting.

.L5:
        movups  xmm3, XMMWORD PTR [rcx+rax]
        movaps  xmm0, xmm6
        movaps  xmm2, xmm6
        movaps  xmm4, xmm3
        cmpltps xmm0, xmm3
        cmpordps        xmm4, xmm3
        pandn   xmm0, xmm4
        andps   xmm2, xmm0
        andnps  xmm0, xmm3
        orps    xmm0, xmm2
        movaps  xmm2, xmm0
        cmpltps xmm2, xmm5
        movdqa  xmm8, xmm2
        pandn   xmm2, xmm4
        pand    xmm8, xmm4
        movaps  xmm4, xmm3
        cmpunordps      xmm4, xmm3
        andps   xmm3, xmm4
        andnps  xmm4, xmm5
        orps    xmm4, xmm3
        movaps  xmm3, xmm5
        andps   xmm3, xmm2
        andnps  xmm2, xmm4
        orps    xmm2, xmm3
        movaps  xmm3, xmm0
        movaps  xmm0, xmm8
        andps   xmm3, xmm8
        andnps  xmm0, xmm2
        orps    xmm0, xmm3
        movups  XMMWORD PTR [rsi+rax], xmm0
        add     rax, 16
        cmp     rax, rdi
        jne     .L5

main vectorized loop of clip_loop2
In comparison, this code is simple and looks just like the non-vectorized version.

.L47:
        movups  xmm6, XMMWORD PTR [rcx+rax]
        movaps  xmm3, xmm5
        movaps  xmm2, xmm4
        maxps   xmm3, xmm6
        minps   xmm2, xmm3
        movups  XMMWORD PTR [rsi+rax], xmm2
        add     rax, 16
        cmp     rdi, rax
        jne     .L47

For MSVC, only clip_loop2 gets SIMD-dified.

@ngoldbaum ngoldbaum added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Apr 18, 2024
@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 19, 2024

I tried annotating the 3 floating point functions like so. Based on inspection of the disassembly of the compiled object file, this does trigger the SIMD-dification. (For whatever reason, NPY_GCC_OPT_3 didn't work).

NPY_NO_EXPORT void
__attribute__((optimize("O3")))
FLOAT_clip(char **args, npy_intp const *dimensions, npy_intp const *steps,
           void *NPY_UNUSED(func))
{
    ...
}

As reported by asv,

  1. turning on -O3 did not improve the original timings of the main branch
  2. turning on -O3 did not further improve the timings of this PR

@seberg
Copy link
Member

seberg commented Apr 19, 2024

Thanks for checking, seems like the float special path is necessary for lifting the nan handling (at least I presume that is the mechanism here). Also nice to know that O3 doesn't actually help with the loop at all :).
Might still wonder if we can integrate that without having to duplicating the code, but not sure and I am OK with it also.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 20, 2024

I got spin bench to run on native Windows (same AMD machine, so the timings can be compared directly against the results in the top-most post) .
As can be seen, main branch clip timings are much worse on Windows (MSVC) than on Linux (GCC); by almost a factor of 4 for float32.
This PR has made the float32 and float64 timings equivalent on both MSVC and GCC.

Change Before [c2a9a1c] <speedup-float-clip~1> After [96ed0d2] Ratio Benchmark (Parameter)
- 16.4±1μs 14.5±0.2μs 0.88 bench_clip.ClipInteger.time_clip(<class 'numpy.int32'>, 100000)
- 125±3μs 29.7±0.4μs 0.24 bench_clip.ClipFloat.time_clip(<class 'numpy.float64'>, 100000)
- 128±3μs 30.7±0.6μs 0.24 bench_clip.ClipFloat.time_clip(<class 'numpy.longdouble'>, 100000)
- 124±6μs 18.3±0.8μs 0.15 bench_clip.ClipFloat.time_clip(<class 'numpy.float32'>, 100000)

@seberg
Copy link
Member

seberg commented Apr 20, 2024

clip timings are much worse on Windows

At some point MSVC had a bug that it fails to inline isnan, I guess it is still there, this PR pulls out the nan check, so that failure to optimize wouldn't be as big of a hit.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 20, 2024

You would be referring to #20134. But in-lining isnan has been in place since VS 2019, which can be observed conveniently with godbolt.

@seberg
Copy link
Member

seberg commented Apr 24, 2024

FWIW, clipping floats is about the one thing that is really useful, so a good speedup there seems OK to just duplicate the code in a simple way. Or does anyone disagree?

It would be nice if someone more familiar with these tags here could chime in, just in case there is a nicer way to do the same thing, but if nobody does, should maybe just move on soon.

The npy_intp is1 = steps[0] / sizeof(T), os1 = steps[3] / sizeof(T); line seems wrong, but maybe it is better to follow up on that, either way.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 27, 2024

I have reduced somewhat the amount of code duplication.

@pijyoi
Copy link
Contributor Author

pijyoi commented Apr 28, 2024

The npy_intp is1 = steps[0] / sizeof(T), os1 = steps[3] / sizeof(T); line seems wrong, but maybe it is better to follow up on that, either way.

Based on the discussion in #26341, I have changed the pointers to char* and just increment the pointers by the byte step size.

I have also verified that keeping the "contiguous" code block is necessary to keep the good performance.
So the current code logic is that better performance is obtained when the step size happens to equal sizeof(T), which should be the common case.

@pijyoi
Copy link
Contributor Author

pijyoi commented May 2, 2024

There's a subtlely that results in different output when considering negative zeros. A negative zero compares equal to a positive zero.

Assuming all operands are non-nan for the sake of this discussion,

The main branch implements:
(When x equals the boundary value, the boundary value is used)

x = x > vmin ? x : vmin;
x = x < vmax ? x : vmax;
>>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([0., 0.])

Whereas this PR implements:
(When x equals the boundary value, x is used)

x = x < vmin ? vmin : x;
x = x > vmax ? vmax : x;
>>> np.clip(np.array([-0.0, +0.0]), 0, 1)
array([-0.,  0.])

Note that the current documentation of np.clip (wrongly) claims to implement the behavior of this PR

Returns: An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.

I have done some limited testing, and to emulate the behavior of the main branch would incur a loss in performance.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtlety that results in different output when considering negative zeros

Thanks for thinking of that subtlety! Since not C99 fmin/fmax doesn't guarantee handling of signed zeros (we can't use those because they don't propagate NaNs), I am happy to ignore it.
(min/max also don't handle it specially, I think. I.e. it might be nice if we had min(-0, 0) == -0 and max(-0, 0) == 0, but C99 doesn't make it easy and I am not sure if hardware supports it quickly normally.)

I suspect you could restore the old behavior by changing the checks to <= and >= and that would only make a difference if you have a lot of boundary values?
But honestly, it doesn't matter.


In either case, I am not sure I like a new bench file for this, but it seems OK. Some nitpicks, putting the stride branch earlier is nice if the code is more complex, but here it is so simple I am not sure it is worthwhile, so just mentioning it.

Thanks @pijyoi! I am happy with putting this in modulo a few nitpicks like the static inline and putting braces.

self.dataout = np.full_like(self.array, 128)

def time_clip(self, dtype, size):
np.clip(self.array, 32, 224, self.dataout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love having a file just for clip, but the bench_ufuncs.py is a weird monster and I am not sure clip can be integrated with the parametrized tests either. So OK with me.

numpy/_core/src/umath/clip.cpp Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Outdated Show resolved Hide resolved
numpy/_core/src/umath/clip.cpp Show resolved Hide resolved
@pijyoi
Copy link
Contributor Author

pijyoi commented May 2, 2024

I suspect you could restore the old behavior by changing the checks to <= and >= and that would only make a difference if you have a lot of boundary values?

Now this is where things differ between compilers

With GCC (and clang), (https://godbolt.org/z/xxe3MGj5a) more assembler instructions are generated. This hurts the performance but the generated code handles the negatively zero correctly. i.e. np.clip(np.array([-0.0, 0.0]), 0, 1) - > array([0., 0.])

With MSVC (and Intel compiler), (https://godbolt.org/z/neb3T54cG) the same assembler instructions are generated for <= as for <. So performance stays the same. But the generated code doesn't handle the negative zero correctly. i.e. np.clip(np.array([-0.0, 0.0]), 0, 1) -> array([-0., 0.]). Or rather, the compiler treats negative zero the same as positive zero, and hence generated the same code.

But honestly, it doesn't matter.

I guess not, since

>>> np.sign(np.array([-0.0, 0.0]))
array([0., 0.])

@seberg
Copy link
Member

seberg commented May 2, 2024

np.sign(np.array([-0.0, 0.0]))

Well, you should have to use np.signbit, although I guess we could make np.sign return -0. also.
I would care about it if I would think it was viable, but as you found nicely, MSVC seems sloppy and the available hardware instructions mean you probably end up with bad performance trade-offs if you want to get it right...

Now this is where things differ between compilers ...

Hah, interesting. Too much info about instructions: So there seems to be maxss and minss instructions available on x64 and they propagate the second operand when either is NaN or when they are equal (i.e. signed 0s). So using >= is't compatible with that as we have to propagate the input on NaN and that means we must propagate the input if equal...

@seberg
Copy link
Member

seberg commented May 2, 2024

Thanks @pijyoi also for fixing that striding bug, let's give this a shot. If someone has an idea of how to clean up the code a bit we can always follow up, plus I suspect that eventually this code will be changed a lot anyway (because it is based no using tags, and we rely on C++17 now, so it should be possible to do this simpler).

@seberg seberg merged commit aa0cc04 into numpy:main May 2, 2024
65 checks passed
@pijyoi pijyoi deleted the speedup-float-clip branch May 2, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants