`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables #23397

ArturoAmorQ · 2022-05-17T13:29:39Z

Describe the bug

The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from time import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 2
X = pd.DataFrame(rng.randn(n_samples, n_features))
X[2] = np.random.choice(
    ["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
)
X[3] = np.random.choice(
    ["jan", "feb", "mar", "apr", "may", "jun",
     "jul", "aug", "sep", "oct", "nov", "dec"],
    size=n_samples,
)
y = np.random.choice(
    [0, 1, 2], size=n_samples, p=[0.01, 0.49, 0.5]
)

preprocessor = make_column_transformer(
    (OrdinalEncoder(), make_column_selector(dtype_include=object)),
    remainder="passthrough"
)
X_transformed = preprocessor.fit_transform(X)

t0 = time()
DecisionTreeClassifier().fit(X_transformed, y)
duration = time() - t0
duration

Expected Results

~450ms

Actual Results

3s

Versions

System:
    python: 3.9.5 | packaged by conda-forge | (default, Jun 19 2021, 00:32:32)  [GCC 9.3.0]
executable: /home/arturoamor/miniforge3/envs/scikit-learn-course/bin/python
   machine: Linux-5.14.0-1036-oem-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.1.0
          pip: 21.1.3
   setuptools: 49.6.0.post20210108
        numpy: 1.21.0
        scipy: 1.7.0
       Cython: None
       pandas: 1.3.0
   matplotlib: 3.4.2
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /home/arturoamor/miniforge3/envs/scikit-learn-course/lib/libopenblasp-r0.3.15.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.15
    num_threads: 8
threading_layer: pthreads

lesteve · 2022-05-17T15:00:54Z

A git bisect points to #22868 (use simultaneous sort in tree splitter).

lesteve · 2022-05-17T15:26:30Z

To reproduce:

# /tmp/test.py
import numpy as np
from time import time
from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 10

rng = np.random.RandomState(0)
X = rng.choice([0, 1, 2], size=(n_samples, n_features))
y = rng.choice([0, 1], size=n_samples)

tree = DecisionTreeClassifier()
t0 = time()
scores = tree.fit(X, y)
duration = time() - t0
print(f"{duration=:.2f}s")

# 4cf932d98 is the simultaenous sort commit aka https://github.com/scikit-learn/scikit-learn/pull/22868
git checkout 4cf932d98; make in > /dev/null 2>&1; ipython /tmp/test.py 
git checkout 4cf932d98~1; make in > /dev/null 2>&1; ipython /tmp/test.py

On my machine the fit time goes from 0.1s to ~8s:

HEAD is now at 4cf932d98 ENH Use simultaenous sort in tree splitter (#22868)
duration=8.37s
Previous HEAD position was 4cf932d98 ENH Use simultaenous sort in tree splitter (#22868)
HEAD is now at bb5ab53f9 API Config: change default display to "diagram" (#22856)
duration=0.10s

Wild guess: maybe the new sort is slow when there are many ties (as is the case in ordinal-encoded categories)?

glemaitre · 2022-05-17T15:32:00Z

Reading our documentation, it seems that simultaneous_sort is a recursive quicksort while sort was an introsort.

ogrisel · 2022-05-17T15:38:48Z

It's weird that the benchmark in the original PR (#22868) did not catch this. Maybe this is because the regression is using low cardinality data while high cardinality data was use in the PR benchmark?

glemaitre · 2022-05-17T15:42:03Z

I assume so. Maybe this is the case where the introsort will switch from quicksort to heapsort internally while we currently get stuck with the quicksort. But I don't know anything about sorting :)

Maybe the median-of-3-killer thing explained in the analysis: https://en.wikipedia.org/wiki/Introsort

lesteve · 2022-05-17T15:51:34Z

Wild-guessing a bit more (not a sorting algorithm expert either), and looking at https://github.com/scikit-learn/scikit-learn/pull/22868/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fcL490-L491 which says

# Introsort with median of 3 pivot selection and 3-way partition function
# (robust to repeated elements, e.g. lots of zero features).

Maybe compared to the previous implementation our simultaneous_sort is missing the 3-way partition function , I am guessing this is explained in more details here

ogrisel · 2022-05-17T16:03:47Z

Indeed I remember that a long long time ago @glouppe and @larsmans did some benchmarks for choice of the sorting algorithm. I think this was introduced in this commit: 31491f9

Edit: here is the PR: #2747

ogrisel · 2022-05-17T16:10:10Z

@lesteve here are the timings of your script on my laptop:

main: 4.02s
1.0.2: 0.09s

So this is indeed a pathological regression for such columns. I think we can try to rollback to the previous implementation for 1.1.1 and maybe later try to see if it's possible to find a "median of 3 pivot selection" variant in the C++ standard library.

glemaitre · 2022-05-19T08:29:11Z

Fixed in #23410

ArturoAmorQ added Bug Needs Triage Issue requires triage labels May 17, 2022

glemaitre removed the Needs Triage Issue requires triage label May 17, 2022

jeremiedbb added Blocker Performance Regression Needs Triage Issue requires triage labels May 17, 2022

glemaitre added High Priority High priority issues and pull requests and removed Blocker Performance Needs Triage Issue requires triage labels May 17, 2022

ogrisel added this to the 1.1.1 milestone May 17, 2022

ogrisel added the Blocker label May 17, 2022

thomasjpfan self-assigned this May 17, 2022

thomasjpfan mentioned this issue May 17, 2022

FIX Fixes performance regression in trees #23404

Closed

glemaitre closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables #23397

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables #23397

ArturoAmorQ commented May 17, 2022 •

edited by glemaitre

lesteve commented May 17, 2022 •

edited

lesteve commented May 17, 2022 •

edited

glemaitre commented May 17, 2022

ogrisel commented May 17, 2022

glemaitre commented May 17, 2022 •

edited

lesteve commented May 17, 2022

ogrisel commented May 17, 2022 •

edited

ogrisel commented May 17, 2022

glemaitre commented May 19, 2022

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables #23397

Comments

ArturoAmorQ commented May 17, 2022 • edited by glemaitre

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

lesteve commented May 17, 2022 • edited

lesteve commented May 17, 2022 • edited

glemaitre commented May 17, 2022

ogrisel commented May 17, 2022

glemaitre commented May 17, 2022 • edited

lesteve commented May 17, 2022

ogrisel commented May 17, 2022 • edited

ogrisel commented May 17, 2022

glemaitre commented May 19, 2022

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables #23397

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables #23397

ArturoAmorQ commented May 17, 2022 •

edited by glemaitre

lesteve commented May 17, 2022 •

edited

lesteve commented May 17, 2022 •

edited

glemaitre commented May 17, 2022 •

edited

ogrisel commented May 17, 2022 •

edited