Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

Open
zchlewis opened this issue Mar 28, 2024 · 7 comments
Labels
Bug Needs Investigation Issue requires investigation Performance

Comments

@zchlewis
Copy link

zchlewis commented Mar 28, 2024

Describe the bug

I have train the Gaussian Process classifier on a 200 points dataset. But it takes 1.5 hour still not get the result. Actually it is not a problem on the intel cpu Mac, but when move the same code on M chip Mac, the problem happens.

Steps/Code to Reproduce

import numpy as np
import pandas as pd

from scipy.special import logsumexp
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from copy import deepcopy


def generate_data(n, seed, shape='circular', noise=0.5):
    
    np.random.seed(seed)
    var = noise

    assert n % 2 == 0
    
    if shape == 'circular':
        # sample polar coordinates
        angles = np.random.uniform(low=0, high=2*np.pi, size=n)
        radii = ys = np.random.binomial(n=1, p=0.5, size=n)
        # transform to cartesian coordinates and add noise
        x1 = np.sin(angles)*radii + np.random.normal(scale=var, size=n)
        x2 = np.cos(angles)*radii + np.random.normal(scale=var, size=n)
        
    elif shape == 'binormal':
        ys = np.random.binomial(n=1, p=0.5, size=n)
        mu_1 = 0.5 - ys
        mu_2 = ys - 0.5
        x1 = np.random.normal(loc=mu_1, scale=var, size=n)
        x2 = np.random.normal(loc=mu_2, scale=var, size=n)
    
    elif shape == 'moon':
        pass

    xs = np.array([x1, x2]).T
    return xs, ys

def get_datasets(seed, n_samples=100, n_test_samples=200):
    moon_set = (
        make_moons(n_samples=n_samples, noise=0.3, random_state=seed),
        make_moons(n_samples=n_test_samples, noise=0.3, random_state=seed+1000)
    )
    circular_set = (
        generate_data(n=n_samples, shape='circular', seed=seed, noise=0.3),
        generate_data(n=n_test_samples, shape='circular', seed=seed+1000, noise=0.3),
    )
    binormal_set = (
        generate_data(n=n_samples, shape='binormal', seed=seed, noise=0.6),
        generate_data(n=n_test_samples, shape='binormal', seed=seed+1000, noise=0.6),
    )
    return moon_set, circular_set, binormal_set

def get_models(clf, reps, n_samples=100):
    results = [[], [], []]
    for seed in range(reps):
        for ds_cnt, ((X_train, y_train), (X_test, y_test)) in enumerate(get_datasets(seed, n_samples=n_samples)):
            new_clf = deepcopy(clf)
            new_clf.fit(X_train, y_train)
            results[ds_cnt].append(new_clf)
    return results

get_models(GaussianProcessClassifier(1.0 * RBF(1.0)), reps=64, n_samples=200)

Expected Results

get the trained model

Actual Results

Runtime too long didn't get the result. So I stop it by KeyboardInterrupt

Versions

System:
    python: 3.10.4 (main, Mar 27 2024, 14:28:43) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/bin/python
   machine: macOS-14.4-arm64-arm-64bit

Python dependencies:
      sklearn: 1.1.2
          pip: 24.0
   setuptools: 58.1.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.8.3
       joblib: 1.3.2
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
@zchlewis zchlewis added Bug Needs Triage Issue requires triage labels Mar 28, 2024
@jeremiedbb
Copy link
Member

I tested it on a linux laptop and it was as fast as expected: <1sec per seed
I also tested on a windows machine with more cores (16 threads) and it was a lot slower: ~8sec per seed
But limiting the number of threads with threadpool_limits made it as fast as on the linux laptop.

@zchlewis how many cores do you have on both machines you tried ? Could you try the following:

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1):
    get_models(GaussianProcessClassifier(1.0 * RBF(1.0)), reps=64, n_samples=200)

It looks like a very bad usage of multi core or maybe oversubscription.

@glemaitre or @ogrisel do you reproduce on your M1 ?

@jeremiedbb jeremiedbb added Performance and removed Needs Triage Issue requires triage labels Mar 28, 2024
@RUrlus
Copy link
Contributor

RUrlus commented Mar 28, 2024

@jeremiedbb I can replicate the issue on a M2 Pro, I'm fairly certain it's an oversubscription issue.
Using your above code I find the following timings:

  • 1 threads: Wall time: 11.1 s
  • 2 threads: Wall time: 9.86 s
  • 4 threads: Wall time: 9.55 s
  • 6 threads: Wall time: 17.4 s
  • >= 7 threads: stall with 100% CPU utilisation

Apple M2 Pro running:

System:
    python: 3.11.3 (main, May  1 2023, 15:33:25) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/bin/python3.11
   machine: macOS-13.6.6-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.2.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.8.3
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2024

I cannot reproduce the problem on my macOS laptop when using Accelerate for the BLAS implementation linked to numpy and scipy installed from conda-forge (mamba install "libblas=*=*accelerate"): it runs in 8s (without any attempt to control the number of threads used by Accelerate).

But I can reproduce the stalling when switching my conda env to openblas and I get similar timings when limiting the number of threads used by openblas.

@ogrisel
Copy link
Member

ogrisel commented Mar 28, 2024

By having a quick look at the code, I did not spot any parallel section in the code that could explain the oversubscription. n_jobs is set to None by default which should keep joblib calls sequential in the OneVsRestClassifier and OneVsOneClassifier calls. And I am not even sure that those are actually used (only the case when n_classes > 2).

@RUrlus or @zchlewis would you be interested in trying to help us investigate the cause of the oversubscription by trying to craft a minimal reproducer that would ideally use only numpy & scipy?

@RUrlus
Copy link
Contributor

RUrlus commented Mar 29, 2024

@ogrisel thanks for investigating. I'll have a go at making a Numpy, Scipy MRE

@jeremiedbb
Copy link
Member

From what I could find, it's not an oversubscription issue. There's no nested parallelism involved here. It's not linked to OpenMP either, not involved in this estimator either.

I see 2 possible causes:

  • a similar situation as in Slowdown when using openblas-pthreads alongside openmp based parallel code OpenMathLib/OpenBLAS#3187, except that it's a bad interaction between the threadpools of both OpenBLAS libs.

    If it was that I'd expect that installing numpy and scipy from conda-forge with a shared openblas would solve it. There was no diff between letting all threads and limiting to 1 which could confirm it, but on the other hand it seemed to always be slow and I can't understand why.

  • OpenBLAS uses as many threads as the number of cpu cores, which is too much for small size operations.

I haven't been able to figure out which one it is yet (or even if the 1st one is possible). It requires more work.

@jeremiedbb jeremiedbb added the Needs Investigation Issue requires investigation label Apr 16, 2024
@RUrlus
Copy link
Contributor

RUrlus commented Apr 17, 2024

@jeremiedbb The reason is suspected over subscription is that with thread_pool limits set to 7 threads I see 100% CPU utilisation on a 12-core M2-Pro. Whereas this does not occur up to 6.

I'm working on tracing all the threads linked to the process but that's not trivial without an instrumented build and other things took priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Investigation Issue requires investigation Performance
Projects
None yet
Development

No branches or pull requests

4 participants