New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715
Comments
I tested it on a linux laptop and it was as fast as expected: <1sec per seed @zchlewis how many cores do you have on both machines you tried ? Could you try the following: from threadpoolctl import threadpool_limits
with threadpool_limits(limits=1):
get_models(GaussianProcessClassifier(1.0 * RBF(1.0)), reps=64, n_samples=200) It looks like a very bad usage of multi core or maybe oversubscription. @glemaitre or @ogrisel do you reproduce on your M1 ? |
@jeremiedbb I can replicate the issue on a M2 Pro, I'm fairly certain it's an oversubscription issue.
Apple M2 Pro running: System:
python: 3.11.3 (main, May 1 2023, 15:33:25) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/bin/python3.11
machine: macOS-13.6.6-arm64-arm-64bit
Python dependencies:
sklearn: 1.4.1.post1
pip: 24.0
setuptools: 69.2.0
numpy: 1.26.4
scipy: 1.12.0
Cython: None
pandas: 2.2.1
matplotlib: 3.8.3
joblib: 1.3.2
threadpoolctl: 3.3.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libopenblas
filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libopenblas
filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.21.dev
threading_layer: pthreads
architecture: armv8
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: libomp
filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None |
I cannot reproduce the problem on my macOS laptop when using Accelerate for the BLAS implementation linked to numpy and scipy installed from conda-forge ( But I can reproduce the stalling when switching my conda env to openblas and I get similar timings when limiting the number of threads used by openblas. |
By having a quick look at the code, I did not spot any parallel section in the code that could explain the oversubscription. n_jobs is set to None by default which should keep joblib calls sequential in the @RUrlus or @zchlewis would you be interested in trying to help us investigate the cause of the oversubscription by trying to craft a minimal reproducer that would ideally use only numpy & scipy? |
@ogrisel thanks for investigating. I'll have a go at making a Numpy, Scipy MRE |
From what I could find, it's not an oversubscription issue. There's no nested parallelism involved here. It's not linked to OpenMP either, not involved in this estimator either. I see 2 possible causes:
I haven't been able to figure out which one it is yet (or even if the 1st one is possible). It requires more work. |
@jeremiedbb The reason is suspected over subscription is that with thread_pool limits set to 7 threads I see 100% CPU utilisation on a 12-core M2-Pro. Whereas this does not occur up to 6. I'm working on tracing all the threads linked to the process but that's not trivial without an instrumented build and other things took priority. |
Describe the bug
I have train the Gaussian Process classifier on a 200 points dataset. But it takes 1.5 hour still not get the result. Actually it is not a problem on the intel cpu Mac, but when move the same code on M chip Mac, the problem happens.
Steps/Code to Reproduce
Expected Results
get the trained model
Actual Results
Runtime too long didn't get the result. So I stop it by KeyboardInterrupt
Versions
The text was updated successfully, but these errors were encountered: