New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DecisionTreeClassifier
became slower in v1.1 when fitting encoded variables
#23397
Comments
A git bisect points to #22868 (use simultaneous sort in tree splitter). |
To reproduce: # /tmp/test.py
import numpy as np
from time import time
from sklearn.tree import DecisionTreeClassifier
rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 10
rng = np.random.RandomState(0)
X = rng.choice([0, 1, 2], size=(n_samples, n_features))
y = rng.choice([0, 1], size=n_samples)
tree = DecisionTreeClassifier()
t0 = time()
scores = tree.fit(X, y)
duration = time() - t0
print(f"{duration=:.2f}s") # 4cf932d98 is the simultaenous sort commit aka https://github.com/scikit-learn/scikit-learn/pull/22868
git checkout 4cf932d98; make in > /dev/null 2>&1; ipython /tmp/test.py
git checkout 4cf932d98~1; make in > /dev/null 2>&1; ipython /tmp/test.py On my machine the fit time goes from 0.1s to ~8s:
Wild guess: maybe the new sort is slow when there are many ties (as is the case in ordinal-encoded categories)? |
Reading our documentation, it seems that |
It's weird that the benchmark in the original PR (#22868) did not catch this. Maybe this is because the regression is using low cardinality data while high cardinality data was use in the PR benchmark? |
I assume so. Maybe this is the case where the introsort will switch from quicksort to heapsort internally while we currently get stuck with the quicksort. But I don't know anything about sorting :) Maybe the median-of-3-killer thing explained in the analysis: https://en.wikipedia.org/wiki/Introsort |
Wild-guessing a bit more (not a sorting algorithm expert either), and looking at https://github.com/scikit-learn/scikit-learn/pull/22868/files#diff-e2cca285e1e883ab1d427120dfa974c1ba83eb6e2f5d5f416bbd99717ca5f5fcL490-L491 which says
Maybe compared to the previous implementation our |
@lesteve here are the timings of your script on my laptop:
So this is indeed a pathological regression for such columns. I think we can try to rollback to the previous implementation for 1.1.1 and maybe later try to see if it's possible to find a "median of 3 pivot selection" variant in the C++ standard library. |
Fixed in #23410 |
Describe the bug
The evaluation of a pipeline that encodes categorical data with v1.1 takes around 8 times longer than using v1.0.2
Steps/Code to Reproduce
Expected Results
~450ms
Actual Results
3s
Versions
The text was updated successfully, but these errors were encountered: