Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

Closed
ArturoAmorQ opened this issue May 17, 2022 · 9 comments
Closed
Assignees
Labels
Blocker Bug High Priority High priority issues and pull requests Regression
Milestone

Comments

@ArturoAmorQ
Copy link
Member

ArturoAmorQ commented May 17, 2022

Describe the bug

The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from time import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 2
X = pd.DataFrame(rng.randn(n_samples, n_features))
X[2] = np.random.choice(
    ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
)
X[3] = np.random.choice(
    ["jan", "feb", "mar", "apr", "may", "jun",
     "jul", "aug", "sep", "oct", "nov", "dec"],
    size=n_samples,
)
y = np.random.choice(
    [0, 1, 2], size=n_samples, p=[0.01, 0.49, 0.5]
)

preprocessor = make_column_transformer(
    (OrdinalEncoder(), make_column_selector(dtype_include=object)),
    remainder="passthrough"
)
X_transformed = preprocessor.fit_transform(X)

t0 = time()
DecisionTreeClassifier().fit(X_transformed, y)
duration = time() - t0
duration

Expected Results

~450ms

Actual Results

3s

Versions

System:
    python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32)  [GCC 9.3.0]
executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python
   machine: Linux-5.14.0-1036-oem-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.1.3
   setuptools: 49.6.0.post20210108
        numpy: 1.21.0
        scipy: 1.7.0
       Cython: None
       pandas: 1.3.0
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/libopenblasp-r0.3.15.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.15
    num_threads: 8
threading_layer: pthreads
@ArturoAmorQ ArturoAmorQ added Bug Needs Triage Issue requires triage labels May 17, 2022
@glemaitre glemaitre removed the Needs Triage Issue requires triage label May 17, 2022
@glemaitre glemaitre added High Priority High priority issues and pull requests and removed Blocker Performance Needs Triage Issue requires triage labels May 17, 2022
@lesteve
Copy link
Member

lesteve commented May 17, 2022

A git bisect points to #22868 (use simultaneous sort in tree splitter).

@lesteve
Copy link
Member

lesteve commented May 17, 2022

To reproduce:

# /tmp/test.py
import numpy as np
from time import time
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 10

rng = np.random.RandomState(0)
X = rng.choice([0, 1, 2], size=(n_samples, n_features))
y = rng.choice([0, 1], size=n_samples)

tree = DecisionTreeClassifier()
t0 = time()
scores = tree.fit(X, y)
duration = time() - t0
print(f"{duration=:.2f}s")
# 4cf932d98 is the simultaenous sort commit aka https://github.com/scikit-learn/scikit-learn/pull/22868
git checkout 4cf932d98; make in > /dev/null 2>&1; ipython /tmp/test.py 
git checkout 4cf932d98~1; make in > /dev/null 2>&1; ipython /tmp/test.py 

On my machine the fit time goes from 0.1s to ~8s:

HEAD is now at 4cf932d98 ENH Use simultaenous sort in tree splitter (#22868)
duration=8.37s
Previous HEAD position was 4cf932d98 ENH Use simultaenous sort in tree splitter (#22868)
HEAD is now at bb5ab53f9 API Config: change default display to "diagram" (#22856)
duration=0.10s

Wild guess: maybe the new sort is slow when there are many ties (as is the case in ordinal-encoded categories)?

@glemaitre
Copy link
Member

Reading our documentation, it seems that simultaneous_sort is a recursive quicksort while sort was an introsort.

@ogrisel ogrisel added this to the 1.1.1 milestone May 17, 2022
@ogrisel
Copy link
Member

ogrisel commented May 17, 2022

It's weird that the benchmark in the original PR (#22868) did not catch this. Maybe this is because the regression is using low cardinality data while high cardinality data was use in the PR benchmark?

@glemaitre
Copy link
Member

glemaitre commented May 17, 2022

I assume so. Maybe this is the case where the introsort will switch from quicksort to heapsort internally while we currently get stuck with the quicksort. But I don't know anything about sorting :)

Maybe the median-of-3-killer thing explained in the analysis: https://en.wikipedia.org/wiki/Introsort

@lesteve
Copy link
Member

lesteve commented May 17, 2022

Wild-guessing a bit more (not a sorting algorithm expert either), and looking at https://github.com/scikit-learn/scikit-learn/pull/22868/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fcL490-L491 which says

# Introsort with median of 3 pivot selection and 3-way partition function
# (robust to repeated elements, e.g. lots of zero features).

Maybe compared to the previous implementation our simultaneous_sort is missing the 3-way partition function , I am guessing this is explained in more details here

@ogrisel
Copy link
Member

ogrisel commented May 17, 2022

Indeed I remember that a long long time ago @glouppe and @larsmans did some benchmarks for choice of the sorting algorithm. I think this was introduced in this commit: 31491f9

Edit: here is the PR: #2747

@ogrisel
Copy link
Member

ogrisel commented May 17, 2022

@lesteve here are the timings of your script on my laptop:

  • main: 4.02s
  • 1.0.2: 0.09s

So this is indeed a pathological regression for such columns. I think we can try to rollback to the previous implementation for 1.1.1 and maybe later try to see if it's possible to find a "median of 3 pivot selection" variant in the C++ standard library.

@glemaitre
Copy link
Member

Fixed in #23410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Bug High Priority High priority issues and pull requests Regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants