Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050

rgommers · 2021-11-17T05:44:28Z

This is a follow up to gh-14688. That issue was originally about a kernel panic (fixed in macOS 12.0.1), and after that the same reproducer showed severe performance issues. This issue is about those performance issues. Note that while the reproducer is the same, it's not clear whether or not the kernel panic and the performance issues share a root cause or not.

Issue reproducer

A reproducer (warning: do NOT run on macOS 11.x, it will crash the OS):

from time import perf_counter
import numpy as np
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

Running scipy.test() or scipy.linalg.test() will also show a significant performance impact.

Performance impact

In situations where we hit the performance problem, the above code will show:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.062 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.888 s
...

And if we don't hit that problem:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
...

So a ~50x slowdown for this particular example.

There is in general an impact on functions that use BLAS/LAPACK. The impact on the total time taken by scipy.test() was about 30% (311 sec. with default settings, 234 sec. when using OPENBLAS_NUM_THREADS=1) - note that this was just a single test on one build config, results may vary: #14688 (comment). The single-threaded case has similar timings as when running the test suite on a scipy install that doesn't show the problem at all (~240 sec. seems expected on arm64 macOS, and it doesn't depend on the threading setting (because test arrays are always small)). Important: ensure pytest-xdist is not installed when looking at time taken by the test suite (see gh-14425 for why).

When the problem occurs

The discussion in gh-14688 showed that this problem gets hit when two copies of libopenblas get loaded. The following configurations showed a problem so far:

Installing both numpy and scipy from a wheel (e.g., numpy 1.21.4 from PyPI and the latest 1.8.0.dev0 wheel from https://anaconda.org/scipy-wheels-nightly/scipy/)
Installing numpy 1.21.4 from PyPI and installing scipy locally when built against conda-forge's openblas.

These configurations did not show a problem:

Installing numpy 1.21.4 from PyPI and installing scipy locally when built against Homebrew's openblas.
Any situation where only a single libopenblas is loaded.

It is unclear right now what the exact root cause is. The situation when using conda-forge's openblas is very similar to that using Homebrew's openblas, but only one of those triggers the issue. The most important situation is installing both NumPy and SciPy from wheels though, that's what the vast majority of pip/PyPI users will get.

A difference between conda-forge and Homebrew that may be relevant is that the former uses @rpath and the latter a hardcoded path to load libopenblas:

% # conda-forge
% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
	@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

% # Homebrew
% otool -L  /opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-*-darwin.so
/opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-39-darwin.so:
	/opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

That may not be the only difference, e.g. compilers used to build libopenblas and scipy were not the same. Also libopenblas can be built with either pthreads or openmp - numpy and scipy wheels use pthreads, while conda-forge and Homebrew both use openmp.

To check if two libopenblas libraries get loaded, use:

❯ python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  }
]

Context: why do 2 libopenblas copies get loaded

The reason is that the NumPy and SciPy wheels both vendor a copy of libopenblas within them, and extension modules that need libopenblas are depending directly on that vendored copy:

% cd /path/to/site-packages/scipy/linalg
% otool -L _fblas.cpython-39-darwin.so 
_fblas.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	@loader_path/../.dylibs/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

% cd ../../numpy/linalg
% otool -L _umath_linalg.cpython-39-darwin.so 
_umath_linalg.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

This is how we have been shipping wheels for years, and it works fine across Windows, Linux and macOS. It seems like a weird thing to do of course (if you know how package managers work but are new to PyPI/wheels) - it's a long story, but the tl;dr is that PyPI wasn't designed with non-Python dependencies in mind, so the usual approach is to bundle those all into a wheel (it tends to work, unless you have complex non-Python dependencies). It'd be very much nontrivial to do any kind of unbundling here, and doing so would break situations where numpy and scipy are not installed in the same way (e.g., the former from conda-forge/Homebrew, the latter from PyPI).

Possible root causes

The kernel panic had to do with spin locks apparently. It is not clear if the performance issues are also due to that, or have a completely different root cause. It does seem to be the case that two copies of the same shared library with the same version (all are libopenblas.0.dylib) cause a conflict at the OS level somehow. Anything beyond that is speculation at this point.

Can we work around the problem?

If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of pip install numpy scipy does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use arm64 just yet but stick with x86_64 (but that does have some performance implications as well).

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

Use threadpoolctl (see the README at https://github.com/joblib/threadpoolctl for how)
Set an environment variable to control the behavior, e.g. OPENBLAS_NUM_THREADS
Rebuild the libopenblas we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a threadpoolctl runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an optional dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Rebuilding libopenblas with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in #14688 (comment) that this is undesirable.

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in scipy/__init__.py then that may be the most pragmatic solution right now. However, this must be done before libopenblas is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of libopenblas. It needs testing whether this then still works around the problem or not.

Note: I wanted to have everything in one place, but let's discuss the release strategy on the mailing list (link to thread), and the actual performance issue here.

Testing on other macOS arm64 build/install configurations

Request: if you have a build config on macOS arm64 that is not covered by the above summary yet, please run the following and reply on this issue with the results:

% python -m threadpoolctl -i scipy.linalg

% cd /PATH/TO/scipy/linalg
% otool -L _fblas.cpython-*-darwin.so

% cd /PATH/TO/numpy/linalg
% otool -L _umath_linalg.cpython-*-darwin.so

% # Run the reproducer (again, only on macOS 12 - you will trigger an OS
% # crash on macOS 11.x!) and report if the time per `eigsh` call is ~0.02 sec. or ~1 sec.

% pip list    # if using pip for everything
% conda list  # if using conda

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-11-17T14:29:05Z

Food for thought: limiting the number of threads to 4 for the scipy OpenBLAS while leaving the numpy OpenBLAS use 8 threads improves the speed a bit but we still have very bad performance compared to limiting both runtimes.

Here is the variation of the benchmark code I used:

from time import perf_counter
import numpy as np
import scipy
from pathlib import Path
from scipy.sparse.linalg import eigsh
from pprint import pprint
from threadpoolctl import ThreadpoolController


openblas_controller = ThreadpoolController().select(internal_api="openblas")

all_openblas_libs = openblas_controller.info()
print("All linked OpenBLAS libraries:")
pprint(all_openblas_libs)

scipy_path = str(Path(scipy.__file__).parent.absolute())
print(f"scipy installed under {scipy_path}")
scipy_shipped_openblas_libs = [
    lib for lib in all_openblas_libs if lib["filepath"].startswith(scipy_path)
]
print("All scipy shipped libraries:")
pprint(scipy_shipped_openblas_libs)

# Limiting all OpenBLAS libraries (from numpy and scipy) make the eigsh
# call work fast
# openblas_controller.limit(limits=4)

# Only limit scipy's OpenBLAS to 4 threads is not enough to make eigsh
# run fast.
assert len(scipy_shipped_openblas_libs) == 1
scipy_openblas_path = scipy_shipped_openblas_libs[0]["filepath"]
scipy_openblas_controller = openblas_controller.select(filepath=scipy_openblas_path)
pprint(scipy_openblas_controller.info())
scipy_openblas_controller.limit(limits=4)

n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

and the results:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.280 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.511 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.393 s
...

ogrisel · 2021-11-17T14:33:45Z

Setting OPENBLAS_THREAD_TIMEOUT=1 without limiting the number of threads fixes the problem!

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.021 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
...

ogrisel · 2021-11-17T14:42:48Z

So the root cause is likely to be exactly the same as OpenMathLib/OpenBLAS#3187 but this time it's not between OpenMP's threadpool and OpenBLAS's threadpool but between 2 OpenBLAS threadpools with sequential blas calls in scipy and numpy called by scipy.

rgommers · 2021-11-17T14:47:34Z

Ah that's awesome! Thanks for finding that @ogrisel!

Looks like that sets the timeout to 1 << 4 = 16 clock cycles if I read this correctly:

rgommers · 2021-11-17T14:51:31Z

The code is a little obscure, it's unclear to me what that does exactly - if the timeout is really short, is there a risk that this will work with the tests, but fail if we pass in really large arrays?

Also, is this something threadpoolctl can control? Otherwise we're back to the problem of needing to set an env var before the first import of both numpy and scipy.

ogrisel · 2021-11-17T15:46:41Z

Also, is this something threadpoolctl can control?

Not at the moment and I am not even sure there is a public function in the OpenBLAS API do do it via ctypes.

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor_init.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

ogrisel · 2021-11-17T15:50:02Z

Note that setting this only in scipy is already very good. This can be simulated with the following import order:

from time import perf_counter
import numpy as np
import os
os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1"
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(3):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

which yields:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.115 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s

ogrisel · 2021-11-17T15:54:00Z

openblas_set_thread_timeout does not seem to exist:

>>> import scipy.linalg
>>> from threadpoolctl import ThreadpoolController
>>> ctl = ThreadpoolController()
>>> ob_ctl = ctl.lib_controllers[0]
>>> ob_ctl._dynlib.openblas_set_thread_timeout(1)
Traceback (most recent call last):
  File "<ipython-input-16-d2dac1caca9a>", line 1, in <module>
    ob_ctl._dynlib.openblas_set_thread_timeout(1)
  File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x20208d5a0, openblas_set_thread_timeout): symbol not found

I tried on the openblas from numpy, that is 0.3.18.

They probably do not want to expose such an API because if the threadpool has already been initialized that would force wait for the termination of existing threads.

rgommers · 2021-11-17T16:16:23Z

Also, is this something threadpoolctl can control?

Not at the moment and I am not even sure there is a public function in the OpenBLAS API do do it via ctypes.

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

That's very helpful, then we don't need a new NumPy release.

They probably do not want to expose such an API because if the threadpool has already been initialized that would force wait for the termination of existing threads.

Makes sense.

jeremiedbb · 2021-11-17T19:14:55Z

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

You could also consider shipping an OpenBLAS built with THREAD_TIMEOUT=1 instead of relying on the _distributor_init

rgommers · 2021-11-17T19:34:35Z

You could also consider shipping an OpenBLAS built with THREAD_TIMEOUT=1 instead of relying on the _distributor_init

That'd be quite a bit more work though, and repeating effort as an extra build for each new OpenBLAS version. Do you see an upside of this?

ogrisel · 2021-11-18T10:25:42Z

Do you see an upside of this?

To avoid having the scipy import have a side effect on other things running in the same Python program or subprocesses.

Although it's quite unlikely to cause big problems anyways in my opinion.

ogrisel · 2021-11-18T10:33:29Z

For a reason I don't understand I do not observe the a similar slowdown on a non Apple M1 platform. For instance with the same version of the numpy/scipy wheels on a linux/x86_64 machine with a large number of cores, I always get good speed when running the above snippet: approximately the same speed with and without the OPENBLAS_THREAD_TIMEOUT=1 env variable. Here is the machine runtime:

$  python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/storage/store/work/ogrisel/mambaforge/envs/scipydev/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 64
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/storage/store/work/ogrisel/mambaforge/envs/scipydev/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 64
  }
]

I also tried with taskset -c 0-7 python ... to use only 8 threads as on the Apple M1 and I get the same result (no problem on this platform).

So maybe we do not fully understand the root cause of the issue...

rgommers · 2021-11-18T10:43:02Z

For a reason I don't understand I do not observe the a similar slowdown on a non Apple M1 platform.

Yes indeed, it should be some OS level thing I think, because this same release config has been working fine AFAIK for years on Intel Macs, as well as on Windows and Linux when installed from wheels. And even on Apple M1 it does work fine when mixing in the Homebrew openblas, as noted in the issue description - so it looks like some pretty specific set of factors causing this problem.

So maybe we do not fully understand the root cause of the issue...

I've asked @Developer-Ecosystem-Engineering if they have any insights here.

rgommers · 2021-11-18T10:45:50Z

Do you see an upside of this?

To avoid having the scipy import have a side effect on other things running in the same Python program or subprocesses.

Although it's quite unlikely to cause big problems anyways in my opinion.

That's the question - if it makes sense to do that for all platforms, then we can just make the change in https://github.com/MacPython/openblas-libs/. But I'd be a little reluctant to change it on other platforms, unless we have a reason to. Even if it has no impact, it's again one of those "what if it did matter" changes when investigating the next problem.

ogrisel · 2021-11-18T10:57:36Z

That's the question - if it makes sense to do that for all platforms, then we can just make the change in https://github.com/MacPython/openblas-libs/. But I'd be a little reluctant to change it on other platforms, unless we have a reason to.

One reason would be to workaround the bad interaction with OpenMP runtimes in C/C++/Cython code that does:

loop until convergence:
   call nested OpenMP parallel loop
   call BLAS operation parallelized with the native OpenBLAS threading layer

as described in OpenMathLib/OpenBLAS#3187 and observed in scikit-learn that uses the scipy OpenBLAS via the Cython BLAS API of scipy: scikit-learn/scikit-learn#20642.

However to fully solve the problem observed in scikit-learn we would also need to make sure that the OpenMP runtime linked in scikit-learn does not use spinning waits either.

Note that this OpenBLAS/OpenMP problem impacts the linux platform.

Even if it has no impact, it's again one of those "what if it did matter" changes when investigating the next problem.

I share the feeling.

…nce issues Addresses at least part of scipy/scipy#15050, TBD if that closes it or if we need to rebuild OpenBLAS instead later. For now (and for the 1.7.3 release) this is the safest and quickest improvement.

rgommers · 2021-11-21T09:09:46Z

Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for 1.7.3. To be discussed I'd say if we keep that or if we want to rebuild OpenBLAS for 1.8.0.

rgommers · 2021-11-21T09:16:50Z

One reason would be to workaround the bad interaction with OpenMP runtimes in C/C++/Cython code that does:

So it seems like we should then ship that rebuilt OpenBLAS in both NumPy and SciPy. That'd be one way we could go.

The other, larger change that could be make is to ensure there's only a single OpenBLAS loaded by:

building OpenBLAS with openmp rather than pthreads
shipping a separate OpenMP wheel (libgomp)
shipping a separate OpenBLAS wheel depending on OpenMP
making both NumPy and SciPy depend on OpenBLAS, with some well-thought out versioning strategy.

It'd be the more correct thing to do from a first principles perspective, but I can't say I'd be enthusiastic about that particular house of cards. PyPI's model is unsuitable for this kind of thing, and it encourages doing this also for even more complex stacks (like the geospatial one), which is a bad idea.

So my vote would be for rebuilding OpenBLAS if that solves this particular scikit-learn issue.

ogrisel · 2021-11-22T09:22:41Z

Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for 1.7.3. To be discussed I'd say if we keep that or if we want to rebuild OpenBLAS for 1.8.0.

Indeed for a minor release, a minimal fix only for the macos m1 problem is best.

For the more medium term plan, maybe we could do this series of change more progressively. Here are some suggestions:

Build OpenBLAS with THREAD_TIMEOUT=1 for the wheels will probably help for the short to medium term on making it possible to workaround problems such as Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 scikit-learn/scikit-learn#20642 when there are several threading runtimes linked to the same Python process.

For the even longer term we could try plan changes to completely avoid redundancy of OpenBLAS and OpenMP wheels in the scipy stack. First for OpenBLAS itself:

Expose the Cython BLAS API in numpy.linalg the same way scipy.linalg does it;
Make scipy use the BLAS implementation of numpy and stop shipping it (while continuing exposing it via the Cython BLAS API replicated in scipy for backward compat reasons). This means that pypi will stop shipping two copies of OpenBLAS for the scipy stack which should help with maintenance in the long run (and reduce a bit the scipy wheel sizes which is always nice).
This way numpy would be the official BLAS implementation provider for the full scipy stack.

About the switch from native OpenBLAS threads to OpenMP, this would be nice for scikit-learn but libgomp is still not fork safe. conda-forge typically replaces libgomp by llvm-openmp by default for this reason. Note that on linux, gcc is still the recommended compiler for conda-forge, which means that the library is built against libgomp but then the openmp runtime is replaced afterwards.

Also note, for some reason conda-forge decided to use OpenBLAS native threads by default on Linux while it's also possible to make it use OpenMP. See the 2020-07-17 entry of the conda-forge announcements. Maybe @isuruf can explain why not use LLVM OpenMP (edit: added missing LLVM) by default on Linux?

More details on OpenMP in conda-forge in https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.

ogrisel · 2021-11-22T09:52:55Z

@rgommers about the fix for scipy 1.7.3 and macos/arm64, as explained in MacPython/scipy-wheels#143 (comment), the fix is currently not effective because the _distributor_init.py file is currently only shipped in the windows wheels, not the macos wheels.

rgommers · 2021-11-22T12:23:01Z

This way numpy would be the official BLAS implementation provider for the full scipy stack.

This sounds less than ideal to me - we'd have to duplicate the Cython APIs, plus NumPy has far fewer BLAS/LAPACK needs than SciPy and is still on an older minimum required version than SciPy (that's why for example Accelerate is re-enabled in NumPy but cannot be in SciPy).

My first reaction is that shipping a standalone wheel would be preferable to this solution.

Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.

Yes, I think it does. And it's broader than just packaging, there's an important UX question there as well that deserves thinking about - are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?

ogrisel · 2021-11-22T13:14:18Z

My first reaction is that shipping a standalone wheel would be preferable to this solution.

Fine with me as well.

are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?

I cannot say for scipy but for scikit-learn we have already started to use OpenMP multithreaded Cython code by default for some components and are progressively generalizing this approach in the code base.

ilayn · 2021-11-22T13:27:11Z

There are many places in SciPy where a multithreaded code can boost the performance greatly. Hence if there is a way to achieve it without hundreds of if OpenMP then do prange else range type of instructions then +1 from me too. But the points raised in #10239 should probably be addressed first.

ogrisel · 2021-11-22T13:51:29Z

Cython's prange works fine (automatic sequential fallback) if openmp is not configured at build time. No need to change the source code, just the build flags. Here is how we do it in scikit-learn to fallback in sequential mode if the C/C++ compilers do not support OpenMP:

https://github.com/scikit-learn/scikit-learn/search?q=_OPENMP_SUPPORTED&type=code

ogrisel · 2021-11-25T16:31:20Z

But that command built scipy from source, right? How did you install OpenBLAS in this case? You can introspect which OpenBLAS is linked into your scipy 1.7.2 with python -m threadpoolctl -i scipy.linalg.

ogrisel · 2021-11-25T16:35:41Z

Ok so you built a scipy 1.7.2 from source that linked against an OpenBLAS lib installed separately with homebrew.

The performance problem is probably coming from the fact that their was a packaging problem that append when generating the 1.7.3 wheels for macos/arm64 as explained in #15050 (comment).

ogrisel · 2021-11-25T16:37:15Z

It's interesting to notice that OpenBLAS from homebrew that is using OpenMP as its threading layer does not suffer from the slowdown. I don't know why it's the case.

sunilshah · 2021-11-25T16:37:30Z

But that command built scipy from source, right? How did you install OpenBLAS in this case? You can introspect which OpenBLAS is linked into your scipy 1.7.2 with python -m threadpoolctl -i scipy.linalg.

Sorry, I did not bother to replicate what I had on my configuration from what I had in #14688.

sunilshah · 2021-11-25T16:41:03Z

It's interesting to notice that OpenBLAS from homebrew that is using OpenMP as its threading layer does not suffer from the slowdown. I don't know why it's the case.

This was exactly why I was worried about degrading performance with the new version of the wheel in my comment in #14688.

ogrisel · 2021-11-25T16:54:19Z

My point is that if you want to upgrade to 1.7.3 while still linking to your OpenBLAS from homebrew, you can still build from source:

pip3 install --no-use-pep517 --no-use-wheel  scipy==1.7.3

you do not need to downgrade, just avoid installing the faulty wheel file.

sunilshah · 2021-11-25T17:00:14Z

My point is that if you want to upgrade to 1.7.3 while still linking to your OpenBLAS from homebrew, you can still build from source:
pip3 install --no-use-pep517 --no-use-wheel  scipy==1.7.3
you do not need to downgrade, just avoid installing the faulty wheel file.

Did you mean --no-binary rather than --no-use-wheel ?

ogrisel · 2021-11-25T17:00:59Z

Maybe even --no-binary :all: nowadays, I am not sure :)

sunilshah · 2021-11-25T17:53:27Z

Maybe even --no-binary :all: nowadays, I am not sure :)

Did not succeed to build from source trying different options for --no-binary . Reverted back to 1.7.2 for now.

tylerjereddy · 2021-11-27T14:55:08Z

the _distributor_init.py did not make it into the final 1.7.3 wheel for macos/arm64:

@ogrisel @rgommers hmm really? I backported the approved change that was supposed to do that. If this is true I'd be inclined to add a CI test that checks for the proper inclusion in wheels repo then.

* following on from continued problems reported here: scipy/scipy#15050 * draft in code to check that MacOS wheels get the same `_distributor_init.py` as the one present in the wheels repo, rather than the generic nearly-empty copy actually present in the `1.7.3` release * we've had two pull requests about this and it slipped through into a release despite 3 sets of eyes so need to be careful here; I'd like to the test script added here failing for MacOS in CI like it does locally, before attempting a fix * this is likely still wip, since relative paths/directories can be confusing to predict in the wheels/multibuild control flow, which also flushes in and out of containers I think, etc. (this probably contributed to the original problem) * the script did seem to work in a quick local check though, complaining about our new MacOS file inclusion, but happy with the long-standing windows wheel inclusion for dirs containing wheels for each OS type: `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac` ``` Traceback (most recent call last): File "verify_init.py", line 40, in <module> check_for_dist_init(input_dir=input_dir) File "verify_init.py", line 34, in check_for_dist_init raise ValueError(f"Contents of _distributor_init.py incorrect for {wheel_file}") ValueError: Contents of _distributor_init.py incorrect for /Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl ```

tylerjereddy · 2021-11-28T02:34:01Z

Ok, my draft PR (MacPython/scipy-wheels#151) has a test that seems to correctly detect the issue for MacOS now:

ValueError: Contents of _distributor_init.py incorrect for wheelhouse/scipy-1.8.0.dev0+2099.3f05efb-cp310-cp310-macosx_12_0_arm64.whl

I still need to see if it behaves "ok" for other platforms/matrix entries. Once it does, applying the "actual fix" to include the correct distribution file should allow that test to pass.

isuruf · 2021-11-28T03:29:40Z

https://github.com/MacPython/scipy-wheels/blob/58e0b1dde52fa197bc2bc4b031eadfe5ddfa3532/patch_code.sh#L16 needs another scipy there. cp _distributor_init.py scipy/scipy/_distributor_init.py

rgommers · 2021-11-28T11:45:15Z

needs another scipy there. cp _distributor_init.py scipy/scipy/_distributor_init.py

Indeed, that was the issue - plus a "get this done before going offline for several days", so I didn't get a chance to test the pre-release wheels.

The fix here should be to upload new macOS arm64 wheels for 1.7.3 with a name that adds a build number of 1. We don't need a new release.

charris · 2021-11-28T14:41:06Z

with a name that adds a build number of 1

That's good as long as you don't need to update the source distribution.

rgommers · 2021-11-28T17:46:42Z

That's good as long as you don't need to update the source distribution.

Indeed we don't, it's a small fix in the scipy-wheels repo only.

* following on from continued problems reported here: scipy/scipy#15050 * draft in code to check that MacOS wheels get the same `_distributor_init.py` as the one present in the wheels repo, rather than the generic nearly-empty copy actually present in the `1.7.3` release * we've had two pull requests about this and it slipped through into a release despite 3 sets of eyes so need to be careful here; I'd like to the test script added here failing for MacOS in CI like it does locally, before attempting a fix * this is likely still wip, since relative paths/directories can be confusing to predict in the wheels/multibuild control flow, which also flushes in and out of containers I think, etc. (this probably contributed to the original problem) * the script did seem to work in a quick local check though, complaining about our new MacOS file inclusion, but happy with the long-standing windows wheel inclusion for dirs containing wheels for each OS type: `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux` `python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac` ``` Traceback (most recent call last): File "verify_init.py", line 40, in <module> check_for_dist_init(input_dir=input_dir) File "verify_init.py", line 34, in check_for_dist_init raise ValueError(f"Contents of _distributor_init.py incorrect for {wheel_file}") ValueError: Contents of _distributor_init.py incorrect for /Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl ```

tylerjereddy · 2021-11-29T02:38:38Z

Hi again, the new wheels for 1.7.3 MacOS arm64 have been uploaded--it would be great to get confirmation that things are looking better performance-wise with these new binaries, which should now be automatically preferred by pip.

For consistency with our normal security/hash disclosure practices:

scipy-1.7.3-1-cp310-cp310-macosx_12_0_arm64.whl
- sha256sum: c9e04d7e9b03a8a6ac2045f7c5ef741be86727d8f49c45db45f244bdd2bcff17
- md5sum: 10de5900c83d976142f44b50e42315bf
scipy-1.7.3-1-cp38-cp38-macosx_12_0_arm64.whl
- sha256sum: b0e0aeb061a1d7dcd2ed59ea57ee56c9b23dd60100825f98238c06ee5cc4467e
- md5sum: 5019b973dbe57508920a5db3dff44d9d
scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl
- sha256sum: b78a35c5c74d336f42f44106174b9851c783184a85a3fe3e68857259b37b9ffb
- md5sum: b6c67a4c972bdb712032bb961c702f16

psobolewskiPhD · 2021-11-29T06:50:55Z

Hi again, the new wheels for 1.7.3 MacOS arm64 have been uploaded--it would be great to get confirmation that things are looking better performance-wise with these new binaries, which should now be automatically preferred by pip.

I can confirm using the new wheel
Downloading scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl (27.0 MB)
In a fresh conda env on M1 macOS 12 fixes the performance issue:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.088 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s

ogrisel · 2021-11-29T10:43:37Z

Confirmed!

sunilshah · 2021-11-30T01:53:56Z

I did manage to install scipy 1.7.3-1 using pip on my M1. OpenBLAS and python3 are installed using Homebrew.

Slowdown seen in 1.7.3 with more threads is no longer there. There is a mild slowdown after the threadcount goes above 5.

Downloading scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl (27.0 MB)

rgommers · 2021-11-30T05:08:17Z

Thanks for the confirmations everyone. Let's call this good and close the issue. For the potential longer-term changes discussed around #15050 (comment), let's open a separate issue (I can do that now).

It would still be good to obtain a more fundamental explanation of what is going wrong at the OS level (and hopefully it's something Apple can make more robust in the future), but for now we're good on avoiding the issue.

ogrisel · 2021-12-02T22:42:30Z

I have also observed that when I run the full scikit-learn test suite with pytest-xdist (pytest -n 8 --pyargs sklearn) when numpy and scipy are installed from the wheels, the total runtime is significantly slower than if numpy and scipy are installed from conda-forge.

I suspect some over-subscription problems between pytest-xdist and the 2 OpenBLAS runtimes of numpy and scipy but I am not yet investigated. Reducing the number of threads with OPENBLAS_NUM_THREADS seems to help.

rgommers · 2021-12-03T15:15:13Z

I suspect some over-subscription problems between pytest-xdist and the 2 OpenBLAS runtimes of numpy and scipy but I am not yet investigated. Reducing the number of threads with OPENBLAS_NUM_THREADS seems to help.

That was my conclusion in #14425 (comment) as well. On CI the optimal setting seems to be OPENBLAS_NUM_THREADS='2'.

rgommers added defect A clear bug or issue that prevents SciPy from being installed or used as expected Official binaries Items related to the official SciPy binaries, including wheels and vendored libs labels Nov 17, 2021

rgommers added this to the 1.7.3 milestone Nov 17, 2021

rgommers mentioned this issue Nov 17, 2021

BUG: ARPACK's eigsh & OpenBLAS from Apple Silicon M1 (arm64) nightly build triggers macOS kernel panic / reboot #14688

Closed

rgommers mentioned this issue Nov 21, 2021

MAINT: set OPENBLAS_THREAD_TIMEOUT on arm64 macOS to avoid performance issues MacPython/scipy-wheels#143

Merged

tylerjereddy mentioned this issue Nov 28, 2021

MAINT, CI: check _distributor_init.py contents MacPython/scipy-wheels#151

Merged

tylerjereddy mentioned this issue Nov 28, 2021

MAINT, BUG: backport _distributor_init.py inclusion fix MacPython/scipy-wheels#152

Merged

rgommers closed this as completed Nov 30, 2021

rgommers mentioned this issue Nov 30, 2021

Consider unifying the two OpenBLAS libraries in NumPy and SciPy wheels Create OpenBLAS wheel #15129

Open

sunilshah mentioned this issue Dec 23, 2021

Tables crashes python with a run time error PyTables/PyTables#903

Closed

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050

Comments

rgommers commented Nov 17, 2021 • edited

Issue reproducer

Performance impact

When the problem occurs

Context: why do 2 libopenblas copies get loaded

Possible root causes

Can we work around the problem?

Testing on other macOS arm64 build/install configurations

ogrisel commented Nov 17, 2021 • edited

ogrisel commented Nov 17, 2021 • edited

ogrisel commented Nov 17, 2021

rgommers commented Nov 17, 2021

rgommers commented Nov 17, 2021

ogrisel commented Nov 17, 2021 • edited

ogrisel commented Nov 17, 2021

ogrisel commented Nov 17, 2021 • edited

rgommers commented Nov 17, 2021

jeremiedbb commented Nov 17, 2021

rgommers commented Nov 17, 2021

ogrisel commented Nov 18, 2021 • edited

ogrisel commented Nov 18, 2021 • edited

rgommers commented Nov 18, 2021

rgommers commented Nov 18, 2021

ogrisel commented Nov 18, 2021 • edited

rgommers commented Nov 21, 2021

rgommers commented Nov 21, 2021

ogrisel commented Nov 22, 2021 • edited

ogrisel commented Nov 22, 2021

rgommers commented Nov 22, 2021

ogrisel commented Nov 22, 2021

ilayn commented Nov 22, 2021

ogrisel commented Nov 22, 2021 • edited

ogrisel commented Nov 25, 2021 • edited

ogrisel commented Nov 25, 2021

ogrisel commented Nov 25, 2021

sunilshah commented Nov 25, 2021

sunilshah commented Nov 25, 2021

ogrisel commented Nov 25, 2021 • edited

sunilshah commented Nov 25, 2021

ogrisel commented Nov 25, 2021 • edited

sunilshah commented Nov 25, 2021

tylerjereddy commented Nov 27, 2021

tylerjereddy commented Nov 28, 2021

isuruf commented Nov 28, 2021

rgommers commented Nov 28, 2021 • edited

charris commented Nov 28, 2021

rgommers commented Nov 28, 2021

tylerjereddy commented Nov 29, 2021

psobolewskiPhD commented Nov 29, 2021

ogrisel commented Nov 29, 2021

sunilshah commented Nov 30, 2021

rgommers commented Nov 30, 2021

ogrisel commented Dec 2, 2021

rgommers commented Dec 3, 2021

rgommers commented Nov 17, 2021 •

edited

ogrisel commented Nov 17, 2021 •

edited

ogrisel commented Nov 17, 2021 •

edited

ogrisel commented Nov 17, 2021 •

edited

ogrisel commented Nov 17, 2021 •

edited

ogrisel commented Nov 18, 2021 •

edited

ogrisel commented Nov 18, 2021 •

edited

ogrisel commented Nov 18, 2021 •

edited

ogrisel commented Nov 22, 2021 •

edited

ogrisel commented Nov 22, 2021 •

edited

ogrisel commented Nov 25, 2021 •

edited

ogrisel commented Nov 25, 2021 •

edited

ogrisel commented Nov 25, 2021 •

edited

rgommers commented Nov 28, 2021 •

edited