Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Fixes performance regression in trees #23404

Closed
wants to merge 4 commits into from

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 17, 2022

Reference Issues/PRs

Fixes #23397

What does this implement/fix? Explain your changes.

This PR adds the heapsort part of introsort back into simultaneous_sort as a flag.

Using the benchmark for low cardinality, I get 3.24 s on main, 0.11 s with this PR, and 0.07 s on 1.0.X.

This PR makes the performance about the same compared to main, but still much faster compared to 1.0.1.

Original with high cardinality benchmark
from time import perf_counter
import json
from statistics import mean, stdev

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from collections import defaultdict


N_SAMPLES = [1_000, 5_000, 10_000, 20_000]
N_REPEATS = 5

results = defaultdict(list)

for n_samples in N_SAMPLES:
    for n_repeat in range(N_REPEATS):
        X, y = make_classification(
            random_state=n_repeat, n_samples=n_samples, n_features=100
        )
        tree = DecisionTreeClassifier(random_state=n_repeat)
        start = perf_counter()
        tree.fit(X, y)
        duration = perf_counter() - start
        results[n_samples].append(duration)
    results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
    print(f"n_samples={n_samples} with {results_mean:.3f} +/- {results_stdev:.3f}")

This PR

n_samples=1000 with 0.043 +/- 0.006
n_samples=5000 with 0.410 +/- 0.116
n_samples=10000 with 1.085 +/- 0.078
n_samples=20000 with 3.276 +/- 0.484

main

n_samples=1000 with 0.044 +/- 0.006
n_samples=5000 with 0.398 +/- 0.108
n_samples=10000 with 1.048 +/- 0.077
n_samples=20000 with 3.179 +/- 0.466

1.0.1

n_samples=1000 with 0.049 +/- 0.007
n_samples=5000 with 0.472 +/- 0.128
n_samples=10000 with 1.240 +/- 0.086
n_samples=20000 with 3.810 +/- 0.560

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

I also tried the low cardinality benchmark:

  • on this branch:
In [2]: %timeit tree.fit(X, y)
175 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • on 1.0.2:
In [2]: %timeit tree.fit(X, y)
97.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So this is almost a 2x slowdown performance regression on low cardinality data.

I am not sure the small perf gain we get on the high cardinality case is worth it.

EDIT: The 2x slowdown factor stays when I switched n_samples = 50k to 200k in the low cardinality benchmark.

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

I have tried to re-implement the old way to do the partitioning in quick sort as part of thomasjpfan#109. However this degrades the performance even more.

The only thing that I have not tried to reimplement the manual tail-call elimination optimisation of the second recursive call.

:mod:`sklearn.tree`
...................

- |Fix| Fixes performance regression :class:`tree.DecisionTreeClassifier`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Fixes performance regression :class:`tree.DecisionTreeClassifier`,
- |Fix| Fixes performance regression with low cardinality features for
:class:`tree.DecisionTreeClassifier`,

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

I explored the hypothesis of using tail call elimination in thomasjpfan#110 and this is the cause of the slowdown either.

I also tried to use log instead of log2 to configure the switch to heapsort and that did not seem to be significant either...

@scikit-learn scikit-learn deleted a comment from ogrisel May 18, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try some profiling to try to identify the discrepancy.

We could also try to reimplement this on top of a fork, prior to the switch to typed memory views. Not sure if it's related or not.

In the mean time here are nitpicks.

if use_introsort == 1:
_simultaneous_sort(values, indices, size, 2 * <int>log2(size), 1)
else:
_simultaneous_sort(values, indices, size, -1, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type is int but we never return anything (0 is implicit in this case I guess).

Let's switch to void?

size - pivot_idx - 1)
_simultaneous_sort(values + pivot_idx + 1,
indices + pivot_idx + 1,
size - pivot_idx - 1, max_depth - 1, use_introsort)
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for the private helper function: the return type should be void.


cdef inline void heapsort(
floating* values,
ITYPE_t* samples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed to indices to use a consistent notation with _simultaneous_sort.

@glemaitre
Copy link
Member

I used the current sorting with pointers (instead of the typed memory views) and I get the following bench:

with 1.0.2:

In [3]: %timeit tree.fit(X, y)
69.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

with the current PR:

In [3]: %timeit tree.fit(X, y)
114 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using pointers instead of typed memoryviews:

In [4]: %timeit tree.fit(X, y)
114 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it is not coming from the memoryviews.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what can explain de remaining perf difference. I tried to do some profiling with py-spy record --format=speedscope -o out.speedscope --native -- python bench_script.py and indeed the difference seem to happen below the sorting calls in BestSplitter but since all the code is inline I cannot see below that level.

Maybe linux perf could help. Alternatively one could instrument the code with counters to record the number of times each of the outer quicksort and inner heapsort functions are called in both branches.

However I don't have the time to do it now, so I am fine with merging this PR because its already fixing most of the regression.

cdef int simultaneous_sort(
cdef inline void sift_down(
floating* values,
ITYPE_t* samples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

samples => indices to be renamed here as well.

@lesteve
Copy link
Member

lesteve commented May 18, 2022

FWIW reverting #22868 goes back to 1.0.2 performance for me, so it does indicate that #22868 is the only thing causing the performance regression (i.e. that there is no other changes at play):

# reverts https://github.com/scikit-learn/scikit-learn/pull/22868
git revert 4cf932d98
make in
# run your benchmarks here

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

@lesteve can you please do a side-PR that does this on top of the current main? As far as I understand there is a bit of tweaking to do to use the memory views for Xf and samples.

@lesteve
Copy link
Member

lesteve commented May 18, 2022

@lesteve can you please do a side-PR that does this on top of the current main? As far as I understand there is a bit of tweaking to do to use the memory views for Xf and samples.

I opened #23410. And indeed there were some conflicts to fix, I was probably navigating the history too much so I reverted for somewhere else than main and the revert did not have any conflicts when I posted my previous message ...

I get the same performance as 1.0.2 in #23410.

@ogrisel
Copy link
Member

ogrisel commented May 19, 2022

Closing in favor of #23410 that still requires a custom backport.

@ogrisel ogrisel closed this May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DecisionTreeClassifier became slower in v1.1 when fitting encoded variables
4 participants