Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) #15050

Closed
rgommers opened this issue Nov 17, 2021 · 55 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Official binaries Items related to the official SciPy binaries, including wheels and vendored libs
Milestone

Comments

@rgommers
Copy link
Member

rgommers commented Nov 17, 2021

This is a follow up to gh-14688. That issue was originally about a kernel panic (fixed in macOS 12.0.1), and after that the same reproducer showed severe performance issues. This issue is about those performance issues. Note that while the reproducer is the same, it's not clear whether or not the kernel panic and the performance issues share a root cause or not.

Issue reproducer

A reproducer (warning: do NOT run on macOS 11.x, it will crash the OS):

from time import perf_counter
import numpy as np
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

Running scipy.test() or scipy.linalg.test() will also show a significant performance impact.

Performance impact

In situations where we hit the performance problem, the above code will show:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.062 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.888 s
...

And if we don't hit that problem:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
...

So a ~50x slowdown for this particular example.

There is in general an impact on functions that use BLAS/LAPACK. The impact on the total time taken by scipy.test() was about 30% (311 sec. with default settings, 234 sec. when using OPENBLAS_NUM_THREADS=1) - note that this was just a single test on one build config, results may vary: #14688 (comment). The single-threaded case has similar timings as when running the test suite on a scipy install that doesn't show the problem at all (~240 sec. seems expected on arm64 macOS, and it doesn't depend on the threading setting (because test arrays are always small)). Important: ensure pytest-xdist is not installed when looking at time taken by the test suite (see gh-14425 for why).

When the problem occurs

The discussion in gh-14688 showed that this problem gets hit when two copies of libopenblas get loaded. The following configurations showed a problem so far:

  • Installing both numpy and scipy from a wheel (e.g., numpy 1.21.4 from PyPI and the latest 1.8.0.dev0 wheel from https://anaconda.org/scipy-wheels-nightly/scipy/)
  • Installing numpy 1.21.4 from PyPI and installing scipy locally when built against conda-forge's openblas.

These configurations did not show a problem:

  • Installing numpy 1.21.4 from PyPI and installing scipy locally when built against Homebrew's openblas.
  • Any situation where only a single libopenblas is loaded.

It is unclear right now what the exact root cause is. The situation when using conda-forge's openblas is very similar to that using Homebrew's openblas, but only one of those triggers the issue. The most important situation is installing both NumPy and SciPy from wheels though, that's what the vast majority of pip/PyPI users will get.

A difference between conda-forge and Homebrew that may be relevant is that the former uses @rpath and the latter a hardcoded path to load libopenblas:

% # conda-forge
% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
	@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

% # Homebrew
% otool -L  /opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-*-darwin.so
/opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-39-darwin.so:
	/opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

That may not be the only difference, e.g. compilers used to build libopenblas and scipy were not the same. Also libopenblas can be built with either pthreads or openmp - numpy and scipy wheels use pthreads, while conda-forge and Homebrew both use openmp.

To check if two libopenblas libraries get loaded, use:

❯ python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  }
]

Context: why do 2 libopenblas copies get loaded

The reason is that the NumPy and SciPy wheels both vendor a copy of libopenblas within them, and extension modules that need libopenblas are depending directly on that vendored copy:

% cd /path/to/site-packages/scipy/linalg
% otool -L _fblas.cpython-39-darwin.so 
_fblas.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	@loader_path/../.dylibs/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

% cd ../../numpy/linalg
% otool -L _umath_linalg.cpython-39-darwin.so 
_umath_linalg.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

This is how we have been shipping wheels for years, and it works fine across Windows, Linux and macOS. It seems like a weird thing to do of course (if you know how package managers work but are new to PyPI/wheels) - it's a long story, but the tl;dr is that PyPI wasn't designed with non-Python dependencies in mind, so the usual approach is to bundle those all into a wheel (it tends to work, unless you have complex non-Python dependencies). It'd be very much nontrivial to do any kind of unbundling here, and doing so would break situations where numpy and scipy are not installed in the same way (e.g., the former from conda-forge/Homebrew, the latter from PyPI).

Possible root causes

The kernel panic had to do with spin locks apparently. It is not clear if the performance issues are also due to that, or have a completely different root cause. It does seem to be the case that two copies of the same shared library with the same version (all are libopenblas.0.dylib) cause a conflict at the OS level somehow. Anything beyond that is speculation at this point.

Can we work around the problem?

If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of pip install numpy scipy does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use arm64 just yet but stick with x86_64 (but that does have some performance implications as well).

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

  • Use threadpoolctl (see the README at https://github.com/joblib/threadpoolctl for how)
  • Set an environment variable to control the behavior, e.g. OPENBLAS_NUM_THREADS
  • Rebuild the libopenblas we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a threadpoolctl runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an optional dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Rebuilding libopenblas with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in #14688 (comment) that this is undesirable.

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in scipy/__init__.py then that may be the most pragmatic solution right now. However, this must be done before libopenblas is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of libopenblas. It needs testing whether this then still works around the problem or not.

Note: I wanted to have everything in one place, but let's discuss the release strategy on the mailing list (link to thread), and the actual performance issue here.

Testing on other macOS arm64 build/install configurations

Request: if you have a build config on macOS arm64 that is not covered by the above summary yet, please run the following and reply on this issue with the results:

% python -m threadpoolctl -i scipy.linalg

% cd /PATH/TO/scipy/linalg
% otool -L _fblas.cpython-*-darwin.so

% cd /PATH/TO/numpy/linalg
% otool -L _umath_linalg.cpython-*-darwin.so

% # Run the reproducer (again, only on macOS 12 - you will trigger an OS
% # crash on macOS 11.x!) and report if the time per `eigsh` call is ~0.02 sec. or ~1 sec.

% pip list    # if using pip for everything
% conda list  # if using conda
@rgommers rgommers added defect A clear bug or issue that prevents SciPy from being installed or used as expected Official binaries Items related to the official SciPy binaries, including wheels and vendored libs labels Nov 17, 2021
@rgommers rgommers added this to the 1.7.3 milestone Nov 17, 2021
@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

Food for thought: limiting the number of threads to 4 for the scipy OpenBLAS while leaving the numpy OpenBLAS use 8 threads improves the speed a bit but we still have very bad performance compared to limiting both runtimes.

Here is the variation of the benchmark code I used:

from time import perf_counter
import numpy as np
import scipy
from pathlib import Path
from scipy.sparse.linalg import eigsh
from pprint import pprint
from threadpoolctl import ThreadpoolController


openblas_controller = ThreadpoolController().select(internal_api="openblas")

all_openblas_libs = openblas_controller.info()
print("All linked OpenBLAS libraries:")
pprint(all_openblas_libs)

scipy_path = str(Path(scipy.__file__).parent.absolute())
print(f"scipy installed under {scipy_path}")
scipy_shipped_openblas_libs = [
    lib for lib in all_openblas_libs if lib["filepath"].startswith(scipy_path)
]
print("All scipy shipped libraries:")
pprint(scipy_shipped_openblas_libs)

# Limiting all OpenBLAS libraries (from numpy and scipy) make the eigsh
# call work fast
# openblas_controller.limit(limits=4)

# Only limit scipy's OpenBLAS to 4 threads is not enough to make eigsh
# run fast.
assert len(scipy_shipped_openblas_libs) == 1
scipy_openblas_path = scipy_shipped_openblas_libs[0]["filepath"]
scipy_openblas_controller = openblas_controller.select(filepath=scipy_openblas_path)
pprint(scipy_openblas_controller.info())
scipy_openblas_controller.limit(limits=4)

n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

and the results:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.280 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.511 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.393 s
...

@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

Setting OPENBLAS_THREAD_TIMEOUT=1 without limiting the number of threads fixes the problem!

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.021 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
...

@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

So the root cause is likely to be exactly the same as OpenMathLib/OpenBLAS#3187 but this time it's not between OpenMP's threadpool and OpenBLAS's threadpool but between 2 OpenBLAS threadpools with sequential blas calls in scipy and numpy called by scipy.

@rgommers
Copy link
Member Author

@rgommers
Copy link
Member Author

The code is a little obscure, it's unclear to me what that does exactly - if the timeout is really short, is there a risk that this will work with the tests, but fail if we pass in really large arrays?

Also, is this something threadpoolctl can control? Otherwise we're back to the problem of needing to set an env var before the first import of both numpy and scipy.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

Also, is this something threadpoolctl can control?

Not at the moment and I am not even sure there is a public function in the OpenBLAS API do do it via ctypes.

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor_init.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

Note that setting this only in scipy is already very good. This can be simulated with the following import order:

from time import perf_counter
import numpy as np
import os
os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1"
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(3):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

which yields:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.115 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s

@ogrisel
Copy link
Contributor

ogrisel commented Nov 17, 2021

openblas_set_thread_timeout does not seem to exist:

>>> import scipy.linalg
>>> from threadpoolctl import ThreadpoolController
>>> ctl = ThreadpoolController()
>>> ob_ctl = ctl.lib_controllers[0]
>>> ob_ctl._dynlib.openblas_set_thread_timeout(1)
Traceback (most recent call last):
  File "<ipython-input-16-d2dac1caca9a>", line 1, in <module>
    ob_ctl._dynlib.openblas_set_thread_timeout(1)
  File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/Users/ogrisel/mambaforge/envs/scipydev/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(0x20208d5a0, openblas_set_thread_timeout): symbol not found

I tried on the openblas from numpy, that is 0.3.18.

They probably do not want to expose such an API because if the threadpool has already been initialized that would force wait for the termination of existing threads.

@rgommers
Copy link
Member Author

Also, is this something threadpoolctl can control?

Not at the moment and I am not even sure there is a public function in the OpenBLAS API do do it via ctypes.

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

That's very helpful, then we don't need a new NumPy release.

They probably do not want to expose such an API because if the threadpool has already been initialized that would force wait for the termination of existing threads.

Makes sense.

@jeremiedbb
Copy link
Contributor

But we could set os.environ["OPENBLAS_THREAD_TIMEOUT"] = "1" in the _distributor.py of the numpy and scipy wheels, at least for the macos/arm64 platform for now and maybe for other platforms later.

You could also consider shipping an OpenBLAS built with THREAD_TIMEOUT=1 instead of relying on the _distributor_init

@rgommers
Copy link
Member Author

You could also consider shipping an OpenBLAS built with THREAD_TIMEOUT=1 instead of relying on the _distributor_init

That'd be quite a bit more work though, and repeating effort as an extra build for each new OpenBLAS version. Do you see an upside of this?

@ogrisel
Copy link
Contributor

ogrisel commented Nov 18, 2021

Do you see an upside of this?

To avoid having the scipy import have a side effect on other things running in the same Python program or subprocesses.

Although it's quite unlikely to cause big problems anyways in my opinion.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 18, 2021

For a reason I don't understand I do not observe the a similar slowdown on a non Apple M1 platform. For instance with the same version of the numpy/scipy wheels on a linux/x86_64 machine with a large number of cores, I always get good speed when running the above snippet: approximately the same speed with and without the OPENBLAS_THREAD_TIMEOUT=1 env variable. Here is the machine runtime:

$  python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/storage/store/work/ogrisel/mambaforge/envs/scipydev/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 64
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/storage/store/work/ogrisel/mambaforge/envs/scipydev/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 64
  }
]

I also tried with taskset -c 0-7 python ... to use only 8 threads as on the Apple M1 and I get the same result (no problem on this platform).

So maybe we do not fully understand the root cause of the issue...

@rgommers
Copy link
Member Author

For a reason I don't understand I do not observe the a similar slowdown on a non Apple M1 platform.

Yes indeed, it should be some OS level thing I think, because this same release config has been working fine AFAIK for years on Intel Macs, as well as on Windows and Linux when installed from wheels. And even on Apple M1 it does work fine when mixing in the Homebrew openblas, as noted in the issue description - so it looks like some pretty specific set of factors causing this problem.

So maybe we do not fully understand the root cause of the issue...

I've asked @Developer-Ecosystem-Engineering if they have any insights here.

@rgommers
Copy link
Member Author

Do you see an upside of this?

To avoid having the scipy import have a side effect on other things running in the same Python program or subprocesses.

Although it's quite unlikely to cause big problems anyways in my opinion.

That's the question - if it makes sense to do that for all platforms, then we can just make the change in https://github.com/MacPython/openblas-libs/. But I'd be a little reluctant to change it on other platforms, unless we have a reason to. Even if it has no impact, it's again one of those "what if it did matter" changes when investigating the next problem.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 18, 2021

That's the question - if it makes sense to do that for all platforms, then we can just make the change in https://github.com/MacPython/openblas-libs/. But I'd be a little reluctant to change it on other platforms, unless we have a reason to.

One reason would be to workaround the bad interaction with OpenMP runtimes in C/C++/Cython code that does:

loop until convergence:
   call nested OpenMP parallel loop
   call BLAS operation parallelized with the native OpenBLAS threading layer

as described in OpenMathLib/OpenBLAS#3187 and observed in scikit-learn that uses the scipy OpenBLAS via the Cython BLAS API of scipy: scikit-learn/scikit-learn#20642.

However to fully solve the problem observed in scikit-learn we would also need to make sure that the OpenMP runtime linked in scikit-learn does not use spinning waits either.

Note that this OpenBLAS/OpenMP problem impacts the linux platform.

Even if it has no impact, it's again one of those "what if it did matter" changes when investigating the next problem.

I share the feeling.

rgommers added a commit to rgommers/scipy-wheels that referenced this issue Nov 21, 2021
…nce issues

Addresses at least part of scipy/scipy#15050,
TBD if that closes it or if we need to rebuild OpenBLAS instead later.
For now (and for the 1.7.3 release) this is the safest and quickest
improvement.
@rgommers
Copy link
Member Author

Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for 1.7.3. To be discussed I'd say if we keep that or if we want to rebuild OpenBLAS for 1.8.0.

@rgommers
Copy link
Member Author

One reason would be to workaround the bad interaction with OpenMP runtimes in C/C++/Cython code that does:

So it seems like we should then ship that rebuilt OpenBLAS in both NumPy and SciPy. That'd be one way we could go.

The other, larger change that could be make is to ensure there's only a single OpenBLAS loaded by:

  • building OpenBLAS with openmp rather than pthreads
  • shipping a separate OpenMP wheel (libgomp)
  • shipping a separate OpenBLAS wheel depending on OpenMP
  • making both NumPy and SciPy depend on OpenBLAS, with some well-thought out versioning strategy.

It'd be the more correct thing to do from a first principles perspective, but I can't say I'd be enthusiastic about that particular house of cards. PyPI's model is unsuitable for this kind of thing, and it encourages doing this also for even more complex stacks (like the geospatial one), which is a bad idea.

So my vote would be for rebuilding OpenBLAS if that solves this particular scikit-learn issue.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 22, 2021

Opened MacPython/scipy-wheels#143 for the environment variable solution. I'd like to go with that for 1.7.3. To be discussed I'd say if we keep that or if we want to rebuild OpenBLAS for 1.8.0.

Indeed for a minor release, a minimal fix only for the macos m1 problem is best.

For the more medium term plan, maybe we could do this series of change more progressively. Here are some suggestions:

For the even longer term we could try plan changes to completely avoid redundancy of OpenBLAS and OpenMP wheels in the scipy stack. First for OpenBLAS itself:

  • Expose the Cython BLAS API in numpy.linalg the same way scipy.linalg does it;
  • Make scipy use the BLAS implementation of numpy and stop shipping it (while continuing exposing it via the Cython BLAS API replicated in scipy for backward compat reasons). This means that pypi will stop shipping two copies of OpenBLAS for the scipy stack which should help with maintenance in the long run (and reduce a bit the scipy wheel sizes which is always nice).
  • This way numpy would be the official BLAS implementation provider for the full scipy stack.

About the switch from native OpenBLAS threads to OpenMP, this would be nice for scikit-learn but libgomp is still not fork safe. conda-forge typically replaces libgomp by llvm-openmp by default for this reason. Note that on linux, gcc is still the recommended compiler for conda-forge, which means that the library is built against libgomp but then the openmp runtime is replaced afterwards.

Also note, for some reason conda-forge decided to use OpenBLAS native threads by default on Linux while it's also possible to make it use OpenMP. See the 2020-07-17 entry of the conda-forge announcements. Maybe @isuruf can explain why not use LLVM OpenMP (edit: added missing LLVM) by default on Linux?

More details on OpenMP in conda-forge in https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp

Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 22, 2021

@rgommers about the fix for scipy 1.7.3 and macos/arm64, as explained in MacPython/scipy-wheels#143 (comment), the fix is currently not effective because the _distributor_init.py file is currently only shipped in the windows wheels, not the macos wheels.

@rgommers
Copy link
Member Author

This way numpy would be the official BLAS implementation provider for the full scipy stack.

This sounds less than ideal to me - we'd have to duplicate the Cython APIs, plus NumPy has far fewer BLAS/LAPACK needs than SciPy and is still on an older minimum required version than SciPy (that's why for example Accelerate is re-enabled in NumPy but cannot be in SciPy).

My first reaction is that shipping a standalone wheel would be preferable to this solution.

Maybe the topic of using OpenMP in the scipy stack deserves a Scientific Python SPEC.

Yes, I think it does. And it's broader than just packaging, there's an important UX question there as well that deserves thinking about - are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?

@ogrisel
Copy link
Contributor

ogrisel commented Nov 22, 2021

My first reaction is that shipping a standalone wheel would be preferable to this solution.

Fine with me as well.

are we continuing to default to single-threaded everywhere, except for BLAS/LAPACK calls?

I cannot say for scipy but for scikit-learn we have already started to use OpenMP multithreaded Cython code by default for some components and are progressively generalizing this approach in the code base.

@ilayn
Copy link
Member

ilayn commented Nov 22, 2021

There are many places in SciPy where a multithreaded code can boost the performance greatly. Hence if there is a way to achieve it without hundreds of if OpenMP then do prange else range type of instructions then +1 from me too. But the points raised in #10239 should probably be addressed first.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 22, 2021

Cython's prange works fine (automatic sequential fallback) if openmp is not configured at build time. No need to change the source code, just the build flags. Here is how we do it in scikit-learn to fallback in sequential mode if the C/C++ compilers do not support OpenMP:

https://github.com/scikit-learn/scikit-learn/search?q=_OPENMP_SUPPORTED&type=code

@ogrisel
Copy link
Contributor

ogrisel commented Nov 25, 2021

But that command built scipy from source, right? How did you install OpenBLAS in this case? You can introspect which OpenBLAS is linked into your scipy 1.7.2 with python -m threadpoolctl -i scipy.linalg.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 25, 2021

Ok so you built a scipy 1.7.2 from source that linked against an OpenBLAS lib installed separately with homebrew.

The performance problem is probably coming from the fact that their was a packaging problem that append when generating the 1.7.3 wheels for macos/arm64 as explained in #15050 (comment).

@ogrisel
Copy link
Contributor

ogrisel commented Nov 25, 2021

It's interesting to notice that OpenBLAS from homebrew that is using OpenMP as its threading layer does not suffer from the slowdown. I don't know why it's the case.

@sunilshah
Copy link

But that command built scipy from source, right? How did you install OpenBLAS in this case? You can introspect which OpenBLAS is linked into your scipy 1.7.2 with python -m threadpoolctl -i scipy.linalg.

Sorry, I did not bother to replicate what I had on my configuration from what I had in #14688.

@sunilshah
Copy link

It's interesting to notice that OpenBLAS from homebrew that is using OpenMP as its threading layer does not suffer from the slowdown. I don't know why it's the case.

This was exactly why I was worried about degrading performance with the new version of the wheel in my comment in #14688.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 25, 2021

My point is that if you want to upgrade to 1.7.3 while still linking to your OpenBLAS from homebrew, you can still build from source:

pip3 install --no-use-pep517 --no-use-wheel  scipy==1.7.3

you do not need to downgrade, just avoid installing the faulty wheel file.

@sunilshah
Copy link

My point is that if you want to upgrade to 1.7.3 while still linking to your OpenBLAS from homebrew, you can still build from source:

pip3 install --no-use-pep517 --no-use-wheel  scipy==1.7.3

you do not need to downgrade, just avoid installing the faulty wheel file.

Did you mean --no-binary rather than --no-use-wheel ?

@ogrisel
Copy link
Contributor

ogrisel commented Nov 25, 2021

Maybe even --no-binary :all: nowadays, I am not sure :)

@sunilshah
Copy link

Maybe even --no-binary :all: nowadays, I am not sure :)

Did not succeed to build from source trying different options for --no-binary . Reverted back to 1.7.2 for now.

@tylerjereddy
Copy link
Contributor

the _distributor_init.py did not make it into the final 1.7.3 wheel for macos/arm64:

@ogrisel @rgommers hmm really? I backported the approved change that was supposed to do that. If this is true I'd be inclined to add a CI test that checks for the proper inclusion in wheels repo then.

tylerjereddy added a commit to tylerjereddy/scipy-wheels that referenced this issue Nov 28, 2021
* following on from continued problems reported here:
scipy/scipy#15050

* draft in code to check that MacOS wheels get
the same `_distributor_init.py` as the one present
in the wheels repo, rather than the generic nearly-empty
copy actually present in the `1.7.3` release

* we've had two pull requests about this and it slipped through into
a release despite 3 sets of eyes so need to be careful here;
I'd like to the test script added here failing for MacOS in CI
like it does locally, before attempting a fix

* this is likely still wip, since relative paths/directories
can be confusing to predict in the wheels/multibuild control
flow, which also flushes in and out of containers I think, etc.
(this probably contributed to the original problem)

* the script did seem to work in a quick local check though, complaining
about our new MacOS file inclusion, but happy with the long-standing
windows wheel inclusion for dirs containing wheels for each OS type:

`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac`

```
Traceback (most recent call last):
  File "verify_init.py", line 40, in <module>
    check_for_dist_init(input_dir=input_dir)
  File "verify_init.py", line 34, in check_for_dist_init
    raise ValueError(f"Contents of _distributor_init.py incorrect for
{wheel_file}")
ValueError: Contents of _distributor_init.py incorrect for
/Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl
```
tylerjereddy added a commit to tylerjereddy/scipy-wheels that referenced this issue Nov 28, 2021
* following on from continued problems reported here:
scipy/scipy#15050

* draft in code to check that MacOS wheels get
the same `_distributor_init.py` as the one present
in the wheels repo, rather than the generic nearly-empty
copy actually present in the `1.7.3` release

* we've had two pull requests about this and it slipped through into
a release despite 3 sets of eyes so need to be careful here;
I'd like to the test script added here failing for MacOS in CI
like it does locally, before attempting a fix

* this is likely still wip, since relative paths/directories
can be confusing to predict in the wheels/multibuild control
flow, which also flushes in and out of containers I think, etc.
(this probably contributed to the original problem)

* the script did seem to work in a quick local check though, complaining
about our new MacOS file inclusion, but happy with the long-standing
windows wheel inclusion for dirs containing wheels for each OS type:

`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac`

```
Traceback (most recent call last):
  File "verify_init.py", line 40, in <module>
    check_for_dist_init(input_dir=input_dir)
  File "verify_init.py", line 34, in check_for_dist_init
    raise ValueError(f"Contents of _distributor_init.py incorrect for
{wheel_file}")
ValueError: Contents of _distributor_init.py incorrect for
/Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl
```
@tylerjereddy
Copy link
Contributor

Ok, my draft PR (MacPython/scipy-wheels#151) has a test that seems to correctly detect the issue for MacOS now:

ValueError: Contents of _distributor_init.py incorrect for wheelhouse/scipy-1.8.0.dev0+2099.3f05efb-cp310-cp310-macosx_12_0_arm64.whl

I still need to see if it behaves "ok" for other platforms/matrix entries. Once it does, applying the "actual fix" to include the correct distribution file should allow that test to pass.

@isuruf
Copy link
Contributor

isuruf commented Nov 28, 2021

https://github.com/MacPython/scipy-wheels/blob/58e0b1dde52fa197bc2bc4b031eadfe5ddfa3532/patch_code.sh#L16 needs another scipy there. cp _distributor_init.py scipy/scipy/_distributor_init.py

@rgommers
Copy link
Member Author

rgommers commented Nov 28, 2021

needs another scipy there. cp _distributor_init.py scipy/scipy/_distributor_init.py

Indeed, that was the issue - plus a "get this done before going offline for several days", so I didn't get a chance to test the pre-release wheels.

The fix here should be to upload new macOS arm64 wheels for 1.7.3 with a name that adds a build number of 1. We don't need a new release.

@charris
Copy link
Member

charris commented Nov 28, 2021

with a name that adds a build number of 1

That's good as long as you don't need to update the source distribution.

@rgommers
Copy link
Member Author

That's good as long as you don't need to update the source distribution.

Indeed we don't, it's a small fix in the scipy-wheels repo only.

tylerjereddy added a commit to tylerjereddy/scipy-wheels that referenced this issue Nov 28, 2021
* following on from continued problems reported here:
scipy/scipy#15050

* draft in code to check that MacOS wheels get
the same `_distributor_init.py` as the one present
in the wheels repo, rather than the generic nearly-empty
copy actually present in the `1.7.3` release

* we've had two pull requests about this and it slipped through into
a release despite 3 sets of eyes so need to be careful here;
I'd like to the test script added here failing for MacOS in CI
like it does locally, before attempting a fix

* this is likely still wip, since relative paths/directories
can be confusing to predict in the wheels/multibuild control
flow, which also flushes in and out of containers I think, etc.
(this probably contributed to the original problem)

* the script did seem to work in a quick local check though, complaining
about our new MacOS file inclusion, but happy with the long-standing
windows wheel inclusion for dirs containing wheels for each OS type:

`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/windows`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/linux`
`python verify_init.py /Users/treddy/rough_work/scipy/release_173/release/installers/mac`

```
Traceback (most recent call last):
  File "verify_init.py", line 40, in <module>
    check_for_dist_init(input_dir=input_dir)
  File "verify_init.py", line 34, in check_for_dist_init
    raise ValueError(f"Contents of _distributor_init.py incorrect for
{wheel_file}")
ValueError: Contents of _distributor_init.py incorrect for
/Users/treddy/rough_work/scipy/release_173/release/installers/mac/scipy-1.7.3-cp39-cp39-macosx_10_9_x86_64.whl
```
@tylerjereddy
Copy link
Contributor

Hi again, the new wheels for 1.7.3 MacOS arm64 have been uploaded--it would be great to get confirmation that things are looking better performance-wise with these new binaries, which should now be automatically preferred by pip.

For consistency with our normal security/hash disclosure practices:

  • scipy-1.7.3-1-cp310-cp310-macosx_12_0_arm64.whl
    • sha256sum: c9e04d7e9b03a8a6ac2045f7c5ef741be86727d8f49c45db45f244bdd2bcff17
    • md5sum: 10de5900c83d976142f44b50e42315bf
  • scipy-1.7.3-1-cp38-cp38-macosx_12_0_arm64.whl
    • sha256sum: b0e0aeb061a1d7dcd2ed59ea57ee56c9b23dd60100825f98238c06ee5cc4467e
    • md5sum: 5019b973dbe57508920a5db3dff44d9d
  • scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl
    • sha256sum: b78a35c5c74d336f42f44106174b9851c783184a85a3fe3e68857259b37b9ffb
    • md5sum: b6c67a4c972bdb712032bb961c702f16

@psobolewskiPhD
Copy link

Hi again, the new wheels for 1.7.3 MacOS arm64 have been uploaded--it would be great to get confirmation that things are looking better performance-wise with these new binaries, which should now be automatically preferred by pip.

I can confirm using the new wheel
Downloading scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl (27.0 MB)
In a fresh conda env on M1 macOS 12 fixes the performance issue:

running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.088 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s

@ogrisel
Copy link
Contributor

ogrisel commented Nov 29, 2021

Confirmed!

@sunilshah
Copy link

I did manage to install scipy 1.7.3-1 using pip on my M1. OpenBLAS and python3 are installed using Homebrew.

Slowdown seen in 1.7.3 with more threads is no longer there. There is a mild slowdown after the threadcount goes above 5.

Downloading scipy-1.7.3-1-cp39-cp39-macosx_12_0_arm64.whl (27.0 MB)

@rgommers
Copy link
Member Author

Thanks for the confirmations everyone. Let's call this good and close the issue. For the potential longer-term changes discussed around #15050 (comment), let's open a separate issue (I can do that now).

It would still be good to obtain a more fundamental explanation of what is going wrong at the OS level (and hopefully it's something Apple can make more robust in the future), but for now we're good on avoiding the issue.

@ogrisel
Copy link
Contributor

ogrisel commented Dec 2, 2021

I have also observed that when I run the full scikit-learn test suite with pytest-xdist (pytest -n 8 --pyargs sklearn) when numpy and scipy are installed from the wheels, the total runtime is significantly slower than if numpy and scipy are installed from conda-forge.

I suspect some over-subscription problems between pytest-xdist and the 2 OpenBLAS runtimes of numpy and scipy but I am not yet investigated. Reducing the number of threads with OPENBLAS_NUM_THREADS seems to help.

@rgommers
Copy link
Member Author

rgommers commented Dec 3, 2021

I suspect some over-subscription problems between pytest-xdist and the 2 OpenBLAS runtimes of numpy and scipy but I am not yet investigated. Reducing the number of threads with OPENBLAS_NUM_THREADS seems to help.

That was my conclusion in #14425 (comment) as well. On CI the optimal setting seems to be OPENBLAS_NUM_THREADS='2'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected Official binaries Items related to the official SciPy binaries, including wheels and vendored libs
Projects
None yet
Development

No branches or pull requests

10 participants