Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Resolve Divide by Zero on Apple silicon + test failures #19926

Merged
merged 3 commits into from
Sep 25, 2021

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

Clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's reciprocal function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.

The changes here insert a dummy volatile variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.

volatile can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply volatile. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the reciprocal function. Moreover, volatile is only needed when a vector is partially loaded.

Testing:
Fixes #18555 and the changes here also resolve several failures in numpy's test suite.

Before:

FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised
FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09)

After:

FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10)

After addressing this failure, we've added to it (second commit requires the first)

Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector.

The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.

clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.

The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.

`volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded.

Testing:
Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite.

Before:
```
FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised
FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09)
```

After:
```
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10)
```
Enhancement on top of workaround for clang bug in reciprocal (numpy#18555)
Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector.

The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.
@seberg seberg added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Sep 22, 2021
@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Sep 23, 2021
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Not seeing this Azure failure locally, will look into the differences.

It appears to be due to CI using Apple clang version 11.0.0, we will likely need to expand the scope of the workaround for CI to pass, -ftrapping-math is only available on newer clangs for x86_64

@rgommers rgommers added this to the 1.22.0 release milestone Sep 23, 2021
@seberg
Copy link
Member

seberg commented Sep 23, 2021

I am wondering how much effort we should spend on buggy clang (or MacOS math libraries) here, especially if it only affects older versions... Most likely there are far more FPE failures on MacOS anyway and the only reason CI passes is that we don't have systematic unit tests for most functions.

@rgommers
Copy link
Member

I am wondering how much effort we should spend on buggy clang (or MacOS math libraries) here, especially if it only affects older versions

@seberg this remainder issue is important, it's causing a lot of SciPy failures on macOS arm64 and it is the only such issue left. IIRC there were 121 SciPy failures on master last time I checked, and >110 were caused by this warning.

Most likely there are far more FPE failures on MacOS anyway and the only reason CI passes is that we don't have systematic unit tests for most functions.

Other than one SciPy kernel panic related to ARPACK, we're in pretty good shape for macOS arm64. The NumPy test suite passes in release mode.

@rgommers
Copy link
Member

If we can upgrade Clang to the minimum required version in CI that would be fine too. Conda Clang compilers are at 11.1 though, so not much higher than 11.0 as in CI here. If we can keep 11.1 as a base without a large amount of extra effort rather than bumping to 12.0, that would be nice.

@rgommers
Copy link
Member

this remainder issue is important, it's causing a lot of SciPy failures on macOS arm64 and it is the only such issue left. IIRC there were 121 SciPy failures on master last time I checked, and >110 were caused by this warning.

Actually that isn't completely correct. With SciPy master + numpy 1.21.1 there are now 101 failures. About 80 of those are BLAS/LAPACK related. With this PR that's down to 89 test failures, so it resolves about half of the non-linalg failures. What's left is similar in nature:

FAILED lib/python3.9/site-packages/scipy/signal/tests/test_ltisys.py::TestPlacePoles::test_complex - RuntimeWarning: divide by zero encountered in det
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_discrete_distns.py::TestNCH::test_nchypergeom_wallenius_naive - RuntimeWarning: invalid val...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_cont_basic[500-200-beta-arg4] - RuntimeWarning: overflow encounte...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_distributions.py::TestFrozen::test_pickling - RuntimeWarning: overflow encountered in _beta...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_distributions.py::TestExpect::test_beta - RuntimeWarning: overflow encountered in _beta_ppf
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_cont_basic[500-200-rdist-arg85] - RuntimeWarning: overflow encoun...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_cont_basic[500-200-semicircular-arg89] - RuntimeWarning: overflow...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_stats.py::TestNumericalInverseHermite::test_basic[rdist-shapes86] - RuntimeWarning: overflo...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_stats.py::TestNumericalInverseHermite::test_basic[semicircular-shapes90] - RuntimeWarning: ...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_stats.py::TestNumericalInverseHermite::test_inaccurate_CDF - RuntimeWarning: overflow encou...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[beta-args4-ppf] - RuntimeWarning: overflow enc...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[beta-args4-isf] - RuntimeWarning: overflow enc...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[rdist-args86-ppf] - RuntimeWarning: overflow e...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[rdist-args86-isf] - RuntimeWarning: overflow e...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[semicircular-args90-ppf] - RuntimeWarning: ove...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_continuous_basic.py::test_methods_with_lists[semicircular-args90-isf] - RuntimeWarning: ove...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_discrete_basic.py::test_discrete_basic[nbinom-arg13-False] - RuntimeWarning: overflow encou...
FAILED lib/python3.9/site-packages/scipy/stats/tests/test_discrete_basic.py::test_moments[nbinom-arg13] - RuntimeWarning: overflow encountered in _nb...
=

- -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64
- Apple Clang and Clang have different, but overlapping versions
- Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64

Testing was done against Apple Clang versions
- v11 / x86_64 - failed previously, now passes (azure failure)
- v12+ / x86_64 - passes before and after
- v13 / arm64 - failed before initial patch, passes after
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Latest commit should resolve Azure failures now as well, waiting for CI to finish to confirm.

@charris charris merged commit 0b01b48 into numpy:main Sep 25, 2021
@charris
Copy link
Member

charris commented Sep 25, 2021

Thanks @Developer-Ecosystem-Engineering .

charris pushed a commit to charris/numpy that referenced this pull request Sep 25, 2021
…9926)

* Resolve divide by zero in reciprocal numpy#18555

clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.

The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.

`volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded.

Testing:
Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite.

Before:
```
FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised
FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09)
```

After:
```
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10)
```

* Enhancement on top of workaround for clang bug in reciprocal

Enhancement on top of workaround for clang bug in reciprocal (numpy#18555)
Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector.

The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.

* Address Azure CI failures with older versions of clang

- -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64
- Apple Clang and Clang have different, but overlapping versions
- Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64

Testing was done against Apple Clang versions
- v11 / x86_64 - failed previously, now passes (azure failure)
- v12+ / x86_64 - passes before and after
- v13 / arm64 - failed before initial patch, passes after
@isuruf
Copy link
Contributor

isuruf commented Sep 25, 2021

Conda Clang compilers are at 11.1 though, so not much higher than 11.0 as in CI here.

FYI, Conda clang 11.1 is equivalent to Apple clang 12. See the versioning at https://en.wikipedia.org/wiki/Xcode

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Sep 25, 2021
charris added a commit that referenced this pull request Sep 26, 2021
BUG: Resolve Divide by Zero on Apple silicon + test failures (#19926)
howjmay pushed a commit to howjmay/numpy that referenced this pull request Sep 29, 2021
…9926)

* Resolve divide by zero in reciprocal numpy#18555

clang has an optimization bug where a vector that is only partially loaded / stored will get narrowed to only the lanes used, which can be fine in some cases. However, in numpy's `reciprocal` function a partial load is explicitly extended to the full width of the register (filled with '1's) to avoid divide-by-zero. clang's optimization ignores the explicit filling with '1's.

The changes here insert a dummy `volatile` variable. This convinces clang not to narrow the load and ignore the explicit filling of '1's.

`volatile` can be expensive since it forces loads / stores to refresh contents whenever the variable is used. numpy has its own template / macro system that'll expand the loop function below for sqrt, absolute, square, and reciprocal. Additionally, the loop can be called on a full array if there's overlap between src and dst. Consequently, we try to limit the scope that we need to apply `volatile`. Intention is it should only be needed when compiling with clang, against Apple arm64, and only for the `reciprocal` function. Moreover, `volatile` is only needed when a vector is partially loaded.

Testing:
Beyond fixing the cases mentioned in the GitHub issue, the changes here also resolve several failures in numpy's test suite.

Before:
```
FAILED numpy/core/tests/test_scalarmath.py::TestBaseMath::test_blocked - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionError: FloatingPointError not raised
FAILED numpy/core/tests/test_umath.py::TestPower::test_power_float - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
FAILED numpy/core/tests/test_umath.py::TestAVXUfuncs::test_avx_based_ufunc - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormDouble::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormSingle::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
FAILED numpy/linalg/tests/test_linalg.py::TestNormInt64::test_axis - RuntimeWarning: divide by zero encountered in reciprocal
8 failed, 14759 passed, 204 skipped, 1268 deselected, 34 xfailed in 69.90s (0:01:09)
```

After:
```
FAILED numpy/core/tests/test_umath.py::TestSpecialFloats::test_tan - AssertionError: FloatingPointError not raised by tan
1 failed, 14766 passed, 204 skipped, 1268 deselected, 34 xfailed in 70.37s (0:01:10)
```

* Enhancement on top of workaround for clang bug in reciprocal

Enhancement on top of workaround for clang bug in reciprocal (numpy#18555)
Numpy's FP unary loops use a partial load / store on every iteration. The partial load / store helpers each insert a switch statement to know how many elements to handle. This causes a lot of unnecessary branches to be inserted in the loops. The partial load / store is only needed on the final iteration of the loop if it isn't a full vector.

The changes here breakout the final iteration to use the partial load / stores. The loop has been changed to use full load / stores. Additionally, this means we don't need to conditionalize the volatile workaround in the loop.

* Address Azure CI failures with older versions of clang

- -ftrapping-math is default enabled for Numpy, but support in clang is mainly for x86_64
- Apple Clang and Clang have different, but overlapping versions
- Non-Apple Clang versions come from looking at when they started supporting -ftrapping-math for x86_64

Testing was done against Apple Clang versions
- v11 / x86_64 - failed previously, now passes (azure failure)
- v12+ / x86_64 - passes before and after
- v13 / arm64 - failed before initial patch, passes after
h-vetinari added a commit to regro-cf-autotick-bot/numpy-feedstock that referenced this pull request Jan 1, 2022
- test_sincos_float32 taken care of in numpy/numpy#20274
- TestF77Mismatch removed upstream in numpy/numpy#20457
- test_unary_PyUFunc_O_O_method_full fixed in 1.22 (numpy/numpy#19926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeWarning: divide by zero encountered in reciprocal for np.ones(3) ** -1
5 participants