New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050
Comments
Food for thought: limiting the number of threads to 4 for the scipy OpenBLAS while leaving the numpy OpenBLAS use 8 threads improves the speed a bit but we still have very bad performance compared to limiting both runtimes. Here is the variation of the benchmark code I used: from time import perf_counter
import numpy as np
import scipy
from pathlib import Path
from scipy.sparse.linalg import eigsh
from pprint import pprint
from threadpoolctl import ThreadpoolController
openblas_controller = ThreadpoolController().select(internal_api="openblas")
all_openblas_libs = openblas_controller.info()
print("All linked OpenBLAS libraries:")
pprint(all_openblas_libs)
scipy_path = str(Path(scipy.__file__).parent.absolute())
print(f"scipy installed under {scipy_path}")
scipy_shipped_openblas_libs = [
lib for lib in all_openblas_libs if lib["filepath"].startswith(scipy_path)
]
print("All scipy shipped libraries:")
pprint(scipy_shipped_openblas_libs)
# Limiting all OpenBLAS libraries (from numpy and scipy) make the eigsh
# call work fast
# openblas_controller.limit(limits=4)
# Only limit scipy's OpenBLAS to 4 threads is not enough to make eigsh
# run fast.
assert len(scipy_shipped_openblas_libs) == 1
scipy_openblas_path = scipy_shipped_openblas_libs[0]["filepath"]
scipy_openblas_controller = openblas_controller.select(filepath=scipy_openblas_path)
pprint(scipy_openblas_controller.info())
scipy_openblas_controller.limit(limits=4)
n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T
for i in range(10):
print("running eigsh...")
tic = perf_counter()
s, _ = eigsh(K, 3, which="LA", tol=0)
toc = perf_counter()
print(f"computed {s} in {toc - tic:.3f} s") and the results:
|
Setting
|
So the root cause is likely to be exactly the same as OpenMathLib/OpenBLAS#3187 but this time it's not between OpenMP's threadpool and OpenBLAS's threadpool but between 2 OpenBLAS threadpools with sequential blas calls in scipy and numpy called by scipy. |
Ah that's awesome! Thanks for finding that @ogrisel! Looks like that sets the timeout to |
The code is a little obscure, it's unclear to me what that does exactly - if the timeout is really short, is there a risk that this will work with the tests, but fail if we pass in really large arrays? Also, is this something |
Not at the moment and I am not even sure there is a public function in the OpenBLAS API do do it via ctypes. But we could set |
Note that setting this only in scipy is already very good. This can be simulated with the following import order: from time import perf_counter
import numpy as np
import os
os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1"
from scipy.sparse.linalg import eigsh
n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T
for i in range(3):
print("running eigsh...")
tic = perf_counter()
s, _ = eigsh(K, 3, which="LA", tol=0)
toc = perf_counter()
print(f"computed {s} in {toc - tic:.3f} s") which yields:
|
>>> import scipy.linalg
>>> from threadpoolctl import ThreadpoolController
>>> ctl = ThreadpoolController()
>>> ob_ctl = ctl.lib_controllers[0]
>>> ob_ctl._dynlib.openblas_set_thread_timeout(1)
Traceback (most recent call last):
File "<ipython-input-16-d2dac1caca9a>", line 1, in <module>
ob_ctl._dynlib.openblas_set_thread_timeout(1)
File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
func = self.__getitem__(name)
File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x20208d5a0, openblas_set_thread_timeout): symbol not found I tried on the openblas from numpy, that is 0.3.18. They probably do not want to expose such an API because if the threadpool has already been initialized that would force wait for the termination of existing threads. |
That's very helpful, then we don't need a new NumPy release.
Makes sense. |
You could also consider shipping an OpenBLAS built with |
That'd be quite a bit more work though, and repeating effort as an extra build for each new OpenBLAS version. Do you see an upside of this? |
To avoid having the scipy import have a side effect on other things running in the same Python program or subprocesses. Although it's quite unlikely to cause big problems anyways in my opinion. |
For a reason I don't understand I do not observe the a similar slowdown on a non Apple M1 platform. For instance with the same version of the numpy/scipy wheels on a
I also tried with So maybe we do not fully understand the root cause of the issue... |
Yes indeed, it should be some OS level thing I think, because this same release config has been working fine AFAIK for years on Intel Macs, as well as on Windows and Linux when installed from wheels. And even on Apple M1 it does work fine when mixing in the Homebrew
I've asked @Developer-Ecosystem-Engineering if they have any insights here. |
That's the question - if it makes sense to do that for all platforms, then we can just make the change in https://github.com/MacPython/openblas-libs/. But I'd be a little reluctant to change it on other platforms, unless we have a reason to. Even if it has no impact, it's again one of those "what if it did matter" changes when investigating the next problem. |
One reason would be to workaround the bad interaction with OpenMP runtimes in C/C++/Cython code that does:
as described in OpenMathLib/OpenBLAS#3187 and observed in scikit-learn that uses the scipy OpenBLAS via the Cython BLAS API of scipy: scikit-learn/scikit-learn#20642. However to fully solve the problem observed in scikit-learn we would also need to make sure that the OpenMP runtime linked in scikit-learn does not use spinning waits either. Note that this OpenBLAS/OpenMP problem impacts the linux platform.
I share the feeling. |
…nce issues Addresses at least part of scipy/scipy#15050, TBD if that closes it or if we need to rebuild OpenBLAS instead later. For now (and for the 1.7.3 release) this is the safest and quickest improvement.
Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for |
So it seems like we should then ship that rebuilt OpenBLAS in both NumPy and SciPy. That'd be one way we could go. The other, larger change that could be make is to ensure there's only a single OpenBLAS loaded by:
It'd be the more correct thing to do from a first principles perspective, but I can't say I'd be enthusiastic about that particular house of cards. PyPI's model is unsuitable for this kind of thing, and it encourages doing this also for even more complex stacks (like the geospatial one), which is a bad idea. So my vote would be for rebuilding OpenBLAS if that solves this particular scikit-learn issue. |
Indeed for a minor release, a minimal fix only for the macos m1 problem is best. For the more medium term plan, maybe we could do this series of change more progressively. Here are some suggestions:
For the even longer term we could try plan changes to completely avoid redundancy of OpenBLAS and OpenMP wheels in the scipy stack. First for OpenBLAS itself:
About the switch from native OpenBLAS threads to OpenMP, this would be nice for scikit-learn but libgomp is still not fork safe. conda-forge typically replaces libgomp by llvm-openmp by default for this reason. Note that on linux, gcc is still the recommended compiler for conda-forge, which means that the library is built against libgomp but then the openmp runtime is replaced afterwards. Also note, for some reason conda-forge decided to use OpenBLAS native threads by default on Linux while it's also possible to make it use OpenMP. See the More details on OpenMP in conda-forge in https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC. |
@rgommers about the fix for scipy 1.7.3 and macos/arm64, as explained in MacPython/scipy-wheels#143 (comment), the fix is currently not effective because the |
This sounds less than ideal to me - we'd have to duplicate the Cython APIs, plus NumPy has far fewer BLAS/LAPACK needs than SciPy and is still on an older minimum required version than SciPy (that's why for example Accelerate is re-enabled in NumPy but cannot be in SciPy). My first reaction is that shipping a standalone wheel would be preferable to this solution.
Yes, I think it does. And it's broader than just packaging, there's an important UX question there as well that deserves thinking about - are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls? |
Fine with me as well.
I cannot say for scipy but for scikit-learn we have already started to use OpenMP multithreaded Cython code by default for some components and are progressively generalizing this approach in the code base. |
There are many places in SciPy where a multithreaded code can boost the performance greatly. Hence if there is a way to achieve it without hundreds of |
Cython's https://github.com/scikit-learn/scikit-learn/search?q=_OPENMP_SUPPORTED&type=code |
But that command built scipy from source, right? How did you install OpenBLAS in this case? You can introspect which OpenBLAS is linked into your scipy 1.7.2 with |
Ok so you built a scipy 1.7.2 from source that linked against an OpenBLAS lib installed separately with homebrew. The performance problem is probably coming from the fact that their was a packaging problem that append when generating the 1.7.3 wheels for macos/arm64 as explained in #15050 (comment). |
It's interesting to notice that OpenBLAS from homebrew that is using OpenMP as its threading layer does not suffer from the slowdown. I don't know why it's the case. |
Sorry, I did not bother to replicate what I had on my configuration from what I had in #14688. |
This was exactly why I was worried about degrading performance with the new version of the wheel in my comment in #14688. |
My point is that if you want to upgrade to 1.7.3 while still linking to your OpenBLAS from homebrew, you can still build from source:
you do not need to downgrade, just avoid installing the faulty wheel file. |
Did you mean |
Maybe even |
Did not succeed to build from source trying different options for |
* following on from continued problems reported here: scipy/scipy#15050 * draft in code to check that MacOS wheels get the same `_distributor_init.py` as the one present in the wheels repo, rather than the generic nearly-empty copy actually present in the `1.7.3` release * we've had two pull requests about this and it slipped through into a release despite 3 sets of eyes so need to be careful here; I'd like to the test script added here failing for MacOS in CI like it does locally, before attempting a fix * this is likely still wip, since relative paths/directories can be confusing to predict in the wheels/multibuild control flow, which also flushes in and out of containers I think, etc. (this probably contributed to the original problem) * the script did seem to work in a quick local check though, complaining about our new MacOS file inclusion, but happy with the long-standing windows wheel inclusion for dirs containing wheels for each OS type: `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac` ``` Traceback (most recent call last): File "verify_init.py", line 40, in <module> check_for_dist_init(input_dir=input_dir) File "verify_init.py", line 34, in check_for_dist_init raise ValueError(f"Contents of _distributor_init.py incorrect for {wheel_file}") ValueError: Contents of _distributor_init.py incorrect for /Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl ```
* following on from continued problems reported here: scipy/scipy#15050 * draft in code to check that MacOS wheels get the same `_distributor_init.py` as the one present in the wheels repo, rather than the generic nearly-empty copy actually present in the `1.7.3` release * we've had two pull requests about this and it slipped through into a release despite 3 sets of eyes so need to be careful here; I'd like to the test script added here failing for MacOS in CI like it does locally, before attempting a fix * this is likely still wip, since relative paths/directories can be confusing to predict in the wheels/multibuild control flow, which also flushes in and out of containers I think, etc. (this probably contributed to the original problem) * the script did seem to work in a quick local check though, complaining about our new MacOS file inclusion, but happy with the long-standing windows wheel inclusion for dirs containing wheels for each OS type: `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac` ``` Traceback (most recent call last): File "verify_init.py", line 40, in <module> check_for_dist_init(input_dir=input_dir) File "verify_init.py", line 34, in check_for_dist_init raise ValueError(f"Contents of _distributor_init.py incorrect for {wheel_file}") ValueError: Contents of _distributor_init.py incorrect for /Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl ```
Ok, my draft PR (MacPython/scipy-wheels#151) has a test that seems to correctly detect the issue for MacOS now:
I still need to see if it behaves "ok" for other platforms/matrix entries. Once it does, applying the "actual fix" to include the correct distribution file should allow that test to pass. |
https://github.com/MacPython/scipy-wheels/blob/58e0b1dde52fa197bc2bc4b031eadfe5ddfa3532/patch_code.sh#L16 needs another |
Indeed, that was the issue - plus a "get this done before going offline for several days", so I didn't get a chance to test the pre-release wheels. The fix here should be to upload new macOS |
That's good as long as you don't need to update the source distribution. |
Indeed we don't, it's a small fix in the |
* following on from continued problems reported here: scipy/scipy#15050 * draft in code to check that MacOS wheels get the same `_distributor_init.py` as the one present in the wheels repo, rather than the generic nearly-empty copy actually present in the `1.7.3` release * we've had two pull requests about this and it slipped through into a release despite 3 sets of eyes so need to be careful here; I'd like to the test script added here failing for MacOS in CI like it does locally, before attempting a fix * this is likely still wip, since relative paths/directories can be confusing to predict in the wheels/multibuild control flow, which also flushes in and out of containers I think, etc. (this probably contributed to the original problem) * the script did seem to work in a quick local check though, complaining about our new MacOS file inclusion, but happy with the long-standing windows wheel inclusion for dirs containing wheels for each OS type: `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac` ``` Traceback (most recent call last): File "verify_init.py", line 40, in <module> check_for_dist_init(input_dir=input_dir) File "verify_init.py", line 34, in check_for_dist_init raise ValueError(f"Contents of _distributor_init.py incorrect for {wheel_file}") ValueError: Contents of _distributor_init.py incorrect for /Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl ```
Hi again, the new wheels for For consistency with our normal security/hash disclosure practices:
|
I can confirm using the new wheel
|
Confirmed! |
I did manage to install scipy 1.7.3-1 using pip on my M1. OpenBLAS and python3 are installed using Homebrew. Slowdown seen in 1.7.3 with more threads is no longer there. There is a mild slowdown after the threadcount goes above 5.
|
Thanks for the confirmations everyone. Let's call this good and close the issue. For the potential longer-term changes discussed around #15050 (comment), let's open a separate issue (I can do that now). It would still be good to obtain a more fundamental explanation of what is going wrong at the OS level (and hopefully it's something Apple can make more robust in the future), but for now we're good on avoiding the issue. |
I have also observed that when I run the full scikit-learn test suite with pytest-xdist ( I suspect some over-subscription problems between |
That was my conclusion in #14425 (comment) as well. On CI the optimal setting seems to be |
This is a follow up to gh-14688. That issue was originally about a kernel panic (fixed in macOS 12.0.1), and after that the same reproducer showed severe performance issues. This issue is about those performance issues. Note that while the reproducer is the same, it's not clear whether or not the kernel panic and the performance issues share a root cause or not.
Issue reproducer
A reproducer (warning: do NOT run on macOS 11.x, it will crash the OS):
Running
scipy.test()
orscipy.linalg.test()
will also show a significant performance impact.Performance impact
In situations where we hit the performance problem, the above code will show:
And if we don't hit that problem:
So a ~50x slowdown for this particular example.
There is in general an impact on functions that use BLAS/LAPACK. The impact on the total time taken by
scipy.test()
was about 30% (311 sec. with default settings, 234 sec. when usingOPENBLAS_NUM_THREADS=1
) - note that this was just a single test on one build config, results may vary: #14688 (comment). The single-threaded case has similar timings as when running the test suite on ascipy
install that doesn't show the problem at all (~240 sec. seems expected on arm64 macOS, and it doesn't depend on the threading setting (because test arrays are always small)). Important: ensurepytest-xdist
is not installed when looking at time taken by the test suite (see gh-14425 for why).When the problem occurs
The discussion in gh-14688 showed that this problem gets hit when two copies of
libopenblas
get loaded. The following configurations showed a problem so far:numpy
andscipy
from a wheel (e.g.,numpy
1.21.4 from PyPI and the latest1.8.0.dev0
wheel from https://anaconda.org/scipy-wheels-nightly/scipy/)numpy
1.21.4 from PyPI and installingscipy
locally when built against conda-forge'sopenblas
.These configurations did not show a problem:
numpy
1.21.4 from PyPI and installingscipy
locally when built against Homebrew'sopenblas
.libopenblas
is loaded.It is unclear right now what the exact root cause is. The situation when using conda-forge's
openblas
is very similar to that using Homebrew'sopenblas
, but only one of those triggers the issue. The most important situation is installing both NumPy and SciPy from wheels though, that's what the vast majority ofpip
/PyPI users will get.A difference between conda-forge and Homebrew that may be relevant is that the former uses
@rpath
and the latter a hardcoded path to loadlibopenblas
:That may not be the only difference, e.g. compilers used to build
libopenblas
andscipy
were not the same. Alsolibopenblas
can be built with eitherpthreads
oropenmp
- numpy and scipy wheels usepthreads
, while conda-forge and Homebrew both useopenmp
.To check if two
libopenblas
libraries get loaded, use:Context: why do 2 libopenblas copies get loaded
The reason is that the NumPy and SciPy wheels both vendor a copy of
libopenblas
within them, and extension modules that needlibopenblas
are depending directly on that vendored copy:This is how we have been shipping wheels for years, and it works fine across Windows, Linux and macOS. It seems like a weird thing to do of course (if you know how package managers work but are new to PyPI/wheels) - it's a long story, but the tl;dr is that PyPI wasn't designed with non-Python dependencies in mind, so the usual approach is to bundle those all into a wheel (it tends to work, unless you have complex non-Python dependencies). It'd be very much nontrivial to do any kind of unbundling here, and doing so would break situations where
numpy
andscipy
are not installed in the same way (e.g., the former from conda-forge/Homebrew, the latter from PyPI).Possible root causes
The kernel panic had to do with spin locks apparently. It is not clear if the performance issues are also due to that, or have a completely different root cause. It does seem to be the case that two copies of the same shared library with the same version (all are
libopenblas.0.dylib
) cause a conflict at the OS level somehow. Anything beyond that is speculation at this point.Can we work around the problem?
If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of
pip install numpy scipy
does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't usearm64
just yet but stick withx86_64
(but that does have some performance implications as well).At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:
threadpoolctl
(see the README at https://github.com/joblib/threadpoolctl for how)OPENBLAS_NUM_THREADS
libopenblas
we bundle in the wheel to have a max number of threads of 1, 2, or 4.SciPy doesn't have a
threadpoolctl
runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an optional dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.Rebuilding
libopenblas
with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in #14688 (comment) that this is undesirable.Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in
scipy/__init__.py
then that may be the most pragmatic solution right now. However, this must be done beforelibopenblas
is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy oflibopenblas
. It needs testing whether this then still works around the problem or not.Note: I wanted to have everything in one place, but let's discuss the release strategy on the mailing list (link to thread), and the actual performance issue here.
Testing on other macOS arm64 build/install configurations
Request: if you have a build config on macOS arm64 that is not covered by the above summary yet, please run the following and reply on this issue with the results:
The text was updated successfully, but these errors were encountered: