Skip to content

Commit

Permalink
Resolve divide by zero in reciprocal numpy#18555
Browse files Browse the repository at this point in the history
clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.

The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.

`volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded.

Testing:
Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite.

Before:
```
FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised
FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09)
```

After:
```
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10)
```
  • Loading branch information
Developer-Ecosystem-Engineering committed Sep 22, 2021
1 parent 24c4436 commit 7227b49
Showing 1 changed file with 34 additions and 0 deletions.
34 changes: 34 additions & 0 deletions numpy/core/src/umath/loops_unary_fp.dispatch.c.src
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,25 @@ NPY_FINLINE double c_square_f64(double a)
*/
#define CONTIG 0
#define NCONTIG 1

/*
* clang has a bug on at least v13 and prior. The bug is present at -O1 or
* greater. When partially loading a NEON register for a reciprocal operation,
* the remaining elements are set to 1 to avoid divide-by-zero. The partial
* load is paired with a partial store after the reciprocal operation. clang
* notices that the entire NEON register is not needed for the store and
* optimizes out the fill of 1 to the remaining elements. This causes a
* divide-by-zero error that we were trying to avoid by filling.
*
* Using a dummy variable marked 'volatile' convinces clang not to ignore
* the explicit fill of remaining elements.
*/
#if defined(__clang__) && defined(__APPLE__) && defined(__arm64__)
#define WORKAROUND_CLANG_ARM64_RECIPROCAL_BUG 1
#else
#define WORKAROUND_CLANG_ARM64_RECIPROCAL_BUG 0
#endif

/**begin repeat
* #TYPE = FLOAT, DOUBLE#
* #sfx = f32, f64#
Expand All @@ -87,6 +106,7 @@ NPY_FINLINE double c_square_f64(double a)
* #kind = sqrt, absolute, square, reciprocal#
* #intr = sqrt, abs, square, recip#
* #repl_0w1 = 0, 0, 0, 1#
* #RECIP_WORKAROUND = 0, 0, 0, WORKAROUND_CLANG_ARM64_RECIPROCAL_BUG#
*/
/**begin repeat2
* #STYPE = CONTIG, NCONTIG, CONTIG, NCONTIG#
Expand Down Expand Up @@ -126,6 +146,7 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
#endif
/**end repeat3**/
}

for (; len > 0; len -= vstep, src += ssrc*vstep, dst += sdst*vstep) {
#if @STYPE@ == CONTIG
#if @repl_0w1@
Expand All @@ -140,6 +161,17 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
npyv_@sfx@ v_src0 = npyv_loadn_tillz_@sfx@(src, ssrc, len);
#endif
#endif
#if @RECIP_WORKAROUND@
/*
* Workaround clang bug. We use a dummy variable marked 'volatile'
* to convince clang that the entire vector is needed. We only
* want to do this for the last iteration / partial load-store of
* the loop since 'volatile' forces a refresh of the contents.
*/
if(len < vstep){
volatile npyv_@sfx@ unused_but_workaround_bug = v_src0;
}
#endif // @RECIP_WORKAROUND@
npyv_@sfx@ v_unary0 = npyv_@intr@_@sfx@(v_src0);
#if @DTYPE@ == CONTIG
npyv_store_till_@sfx@(dst, len, v_unary0);
Expand All @@ -154,6 +186,8 @@ static void simd_@TYPE@_@kind@_@STYPE@_@DTYPE@
#endif // @VCHK@
/**end repeat**/

#undef WORKAROUND_CLANG_ARM64_RECIPROCAL_BUG

/********************************************************************************
** Defining ufunc inner functions
********************************************************************************/
Expand Down

0 comments on commit 7227b49

Please sign in to comment.