When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

zchlewis · 2024-03-28T09:43:11Z

Describe the bug

I have train the Gaussian Process classifier on a 200 points dataset. But it takes 1.5 hour still not get the result. Actually it is not a problem on the intel cpu Mac, but when move the same code on M chip Mac, the problem happens.

Steps/Code to Reproduce

import numpy as np
import pandas as pd

from scipy.special import logsumexp
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from copy import deepcopy


def generate_data(n, seed, shape='circular', noise=0.5):
    
    np.random.seed(seed)
    var = noise

    assert n % 2 == 0
    
    if shape == 'circular':
        # sample polar coordinates
        angles = np.random.uniform(low=0, high=2*np.pi, size=n)
        radii = ys = np.random.binomial(n=1, p=0.5, size=n)
        # transform to cartesian coordinates and add noise
        x1 = np.sin(angles)*radii + np.random.normal(scale=var, size=n)
        x2 = np.cos(angles)*radii + np.random.normal(scale=var, size=n)
        
    elif shape == 'binormal':
        ys = np.random.binomial(n=1, p=0.5, size=n)
        mu_1 = 0.5 - ys
        mu_2 = ys - 0.5
        x1 = np.random.normal(loc=mu_1, scale=var, size=n)
        x2 = np.random.normal(loc=mu_2, scale=var, size=n)
    
    elif shape == 'moon':
        pass

    xs = np.array([x1, x2]).T
    return xs, ys

def get_datasets(seed, n_samples=100, n_test_samples=200):
    moon_set = (
        make_moons(n_samples=n_samples, noise=0.3, random_state=seed),
        make_moons(n_samples=n_test_samples, noise=0.3, random_state=seed+1000)
    )
    circular_set = (
        generate_data(n=n_samples, shape='circular', seed=seed, noise=0.3),
        generate_data(n=n_test_samples, shape='circular', seed=seed+1000, noise=0.3),
    )
    binormal_set = (
        generate_data(n=n_samples, shape='binormal', seed=seed, noise=0.6),
        generate_data(n=n_test_samples, shape='binormal', seed=seed+1000, noise=0.6),
    )
    return moon_set, circular_set, binormal_set

def get_models(clf, reps, n_samples=100):
    results = [[], [], []]
    for seed in range(reps):
        for ds_cnt, ((X_train, y_train), (X_test, y_test)) in enumerate(get_datasets(seed, n_samples=n_samples)):
            new_clf = deepcopy(clf)
            new_clf.fit(X_train, y_train)
            results[ds_cnt].append(new_clf)
    return results

get_models(GaussianProcessClassifier(1.0 * RBF(1.0)), reps=64, n_samples=200)

Expected Results

get the trained model

Actual Results

Runtime too long didn't get the result. So I stop it by KeyboardInterrupt

Versions

System:
    python: 3.10.4 (main, Mar 27 2024, 14:28:43) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/bin/python
   machine: macOS-14.4-arm64-arm-64bit

Python dependencies:
      sklearn: 1.1.2
          pip: 24.0
   setuptools: 58.1.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.8.3
       joblib: 1.3.2
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/hb70ur/.pyenv/versions/3.10.4/envs/UE/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8

jeremiedbb · 2024-03-28T10:24:14Z

I tested it on a linux laptop and it was as fast as expected: <1sec per seed
I also tested on a windows machine with more cores (16 threads) and it was a lot slower: ~8sec per seed
But limiting the number of threads with threadpool_limits made it as fast as on the linux laptop.

@zchlewis how many cores do you have on both machines you tried ? Could you try the following:

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1):
    get_models(GaussianProcessClassifier(1.0 * RBF(1.0)), reps=64, n_samples=200)

It looks like a very bad usage of multi core or maybe oversubscription.

@glemaitre or @ogrisel do you reproduce on your M1 ?

RUrlus · 2024-03-28T11:55:20Z

@jeremiedbb I can replicate the issue on a M2 Pro, I'm fairly certain it's an oversubscription issue.
Using your above code I find the following timings:

1 threads: Wall time: 11.1 s
2 threads: Wall time: 9.86 s
4 threads: Wall time: 9.55 s
6 threads: Wall time: 17.4 s
>= 7 threads: stall with 100% CPU utilisation

Apple M2 Pro running:

System:
    python: 3.11.3 (main, May  1 2023, 15:33:25) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/bin/python3.11
   machine: macOS-13.6.6-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.2.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.8.3
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/<user>/.pyenv/versions/3.11.3/envs/PPU/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

ogrisel · 2024-03-28T13:14:12Z

I cannot reproduce the problem on my macOS laptop when using Accelerate for the BLAS implementation linked to numpy and scipy installed from conda-forge (mamba install "libblas=*=*accelerate"): it runs in 8s (without any attempt to control the number of threads used by Accelerate).

But I can reproduce the stalling when switching my conda env to openblas and I get similar timings when limiting the number of threads used by openblas.

ogrisel · 2024-03-28T13:22:36Z

By having a quick look at the code, I did not spot any parallel section in the code that could explain the oversubscription. n_jobs is set to None by default which should keep joblib calls sequential in the OneVsRestClassifier and OneVsOneClassifier calls. And I am not even sure that those are actually used (only the case when n_classes > 2).

@RUrlus or @zchlewis would you be interested in trying to help us investigate the cause of the oversubscription by trying to craft a minimal reproducer that would ideally use only numpy & scipy?

RUrlus · 2024-03-29T08:22:54Z

@ogrisel thanks for investigating. I'll have a go at making a Numpy, Scipy MRE

jeremiedbb · 2024-04-16T12:56:53Z

From what I could find, it's not an oversubscription issue. There's no nested parallelism involved here. It's not linked to OpenMP either, not involved in this estimator either.

I see 2 possible causes:

a similar situation as in Slowdown when using openblas-pthreads alongside openmp based parallel code OpenMathLib/OpenBLAS#3187, except that it's a bad interaction between the threadpools of both OpenBLAS libs.

If it was that I'd expect that installing numpy and scipy from conda-forge with a shared openblas would solve it. There was no diff between letting all threads and limiting to 1 which could confirm it, but on the other hand it seemed to always be slow and I can't understand why.
OpenBLAS uses as many threads as the number of cpu cores, which is too much for small size operations.

I haven't been able to figure out which one it is yet (or even if the 1st one is possible). It requires more work.

RUrlus · 2024-04-17T10:49:49Z

@jeremiedbb The reason is suspected over subscription is that with thread_pool limits set to 7 threads I see 100% CPU utilisation on a 12-core M2-Pro. Whereas this does not occur up to 6.

I'm working on tracing all the threads linked to the process but that's not trivial without an instrumented build and other things took priority.

zchlewis added Bug Needs Triage Issue requires triage labels Mar 28, 2024

jeremiedbb added Performance and removed Needs Triage Issue requires triage labels Mar 28, 2024

jeremiedbb added the Needs Investigation Issue requires investigation label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

zchlewis commented Mar 28, 2024 •

edited by jeremiedbb

jeremiedbb commented Mar 28, 2024

RUrlus commented Mar 28, 2024 •

edited

ogrisel commented Mar 28, 2024 •

edited

ogrisel commented Mar 28, 2024 •

edited

RUrlus commented Mar 29, 2024

jeremiedbb commented Apr 16, 2024

RUrlus commented Apr 17, 2024 •

edited

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time #28715

Comments

zchlewis commented Mar 28, 2024 • edited by jeremiedbb

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jeremiedbb commented Mar 28, 2024

RUrlus commented Mar 28, 2024 • edited

ogrisel commented Mar 28, 2024 • edited

ogrisel commented Mar 28, 2024 • edited

RUrlus commented Mar 29, 2024

jeremiedbb commented Apr 16, 2024

RUrlus commented Apr 17, 2024 • edited

zchlewis commented Mar 28, 2024 •

edited by jeremiedbb

RUrlus commented Mar 28, 2024 •

edited

ogrisel commented Mar 28, 2024 •

edited

ogrisel commented Mar 28, 2024 •

edited

RUrlus commented Apr 17, 2024 •

edited