Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX fix performance regression in trees with low-cardinality features #23410

Merged
merged 4 commits into from May 19, 2022

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented May 18, 2022

A more conservative alternative to #23404. This reverts #22868 and fixes the conflicts.

With the following benchmark script, I get a similar performance in this PR and in 1.0.2:

n_samples=50000, n_features=10: 0.087 +/- 0.002
# /tmp/test.py
import numpy as np
from time import perf_counter

from statistics import mean, stdev
from collections import defaultdict

from sklearn.tree import DecisionTreeClassifier

rng = np.random.RandomState(0)
n_samples, n_features = 50_000, 10

tree = DecisionTreeClassifier()

N_REPEATS = 5
results = defaultdict(list)


def make_data(random_state):
    rng = np.random.RandomState(random_state)
    X = rng.choice([0, 1, 2], size=(n_samples, n_features))
    y = rng.choice([0, 1], size=n_samples)
    return X, y


for n_repeat in range(N_REPEATS):
    X, y = make_data(n_repeat)
    tree = DecisionTreeClassifier(random_state=n_repeat)
    start = perf_counter()
    tree.fit(X, y)
    duration = perf_counter() - start
    results[n_samples].append(duration)

results_mean, results_stdev = mean(results[n_samples]), stdev(results[n_samples])
print(
    f"n_samples={n_samples}, n_features={n_features}: {results_mean:.3f} +/- {results_stdev:.3f}"
)

@lesteve lesteve changed the title Revert use of simultaneous_sort Revert use of simultaneous_sort in trees May 18, 2022
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for merging this a regression fix and explore in a later PR to main how to consolidate the 2 introsort/quicksort implementation with more extensive benchmarking for various data distributions.

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

Actually I spoke too quickly, we now have:

IndexError: Out of bounds on buffer access (axis 0)

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42268&view=logs&j=c8afde5f-ef70-5983-62e8-c6b665ad6161&t=7fc5d80c-d67c-560e-65a2-6822419a270a

that was previously fixed in main.

simultaneous_sort(&Xf[start_positive], &samples[start_positive], end - start_positive)
sort(&Xf[start], &samples[start], end_negative - start)
sort(&Xf[start_positive], &samples[start_positive],
end - start_positive)
Copy link
Member

@ogrisel ogrisel May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to re-introduce the if start_positive < end: condition here.

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

@glemaitre
Copy link
Member

glemaitre commented May 18, 2022

how to consolidate the 2 introsort/quicksort implementation with more extensive benchmarking for various data distributions.

I actually looked at the std::sort that should be an introsort and available in Cython.
However, since we want to do a simultaneous sort the indices and from values, it makes it quite hard to use. Indeed, you want a comparator that needs to rely on a closure (that is in a Cython class furthermore). I could find the following information that shows the pattern: https://stackoverflow.com/a/63800579/6513708

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 once the bug is fixed and we acknowledge the bug in what's new.

@ogrisel
Copy link
Member

ogrisel commented May 18, 2022

Indeed, you want a comparator that needs to rely on a closure (that is in a Cython class furthermore). I could find the following information that shows the pattern: https://stackoverflow.com/a/63800579/6513708

Nice trick! +1 for a follow-up PR for 1.2.

@thomasjpfan
Copy link
Member

thomasjpfan commented May 18, 2022

To use the comparator, we would need to represent the values and indices as an "array of structures". Currently, it is an "structure of arrays".

As for this PR, I am +1 on it as well. Backporting this fix should be easier, since the version on 1.1.X still uses pointers.

@glemaitre
Copy link
Member

glemaitre commented May 18, 2022

To use the comparator, we would need to represent the values and indices as an "array of structures". Currently, it is an "structure of arrays".

Here we only have an array of values (feature values) and an array of indices (sample indices) to sort, isn't it?

@thomasjpfan
Copy link
Member

thomasjpfan commented May 18, 2022

Here we only have an array of values (feature values) and an array of indices (sample indices) to sort, isn't it?

This representation is like a "structure of arrays". Currently, the code is similar to this:

struct MyArrays:
    double* values_array
    int* indices_array

and dual_swap has the index to follow the value.

To use a comparator, the values and indices need to be it's own structure:

struct ValueIndex:
    double value
    int indices

ValueIndex* array_to_sort

This way the index will follow the value when value gets sorted. (The comparator will only consider value for sorting purposes)

@thomasjpfan thomasjpfan added this to the 1.1.1 milestone May 18, 2022
@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label May 18, 2022
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 once this has a change log entry under 1.1.1. (As suggested in #23410 (review))

Backporting will be slightly more involved because the 1.1.X branch still users pointers for Xf, but it should not be too difficult.

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved
lesteve and others added 2 commits May 19, 2022 08:41
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@lesteve lesteve changed the title Revert use of simultaneous_sort in trees FIX fix performance regression in trees for low-cardinality features May 19, 2022
@lesteve lesteve changed the title FIX fix performance regression in trees for low-cardinality features FIX fix performance regression in trees with low-cardinality features May 19, 2022
@ogrisel
Copy link
Member

ogrisel commented May 19, 2022

@thomasjpfan the code snippet in https://stackoverflow.com/a/63800579/6513708 shows how to do it directly with a structure of arrays without materializing it as an array of index-value pairs, no? Although I am not sure how it works.

EDIT: actually this solution would not inplace-sort the values, only the indices. We would need an extra step to re-shuffle the values based on the sorted indices which would require a temporary copy. Not necessarily a big deal but we will need to keep that temp buffer around to avoid reallocating it each time. Not sure it's worth it and I am not sure about the CPU cache impact of this extra shuffle w.r.t. the existing inplace implementation.

@ogrisel ogrisel merged commit c217527 into scikit-learn:main May 19, 2022
@lesteve lesteve deleted the revert-simultaneous-sort branch May 19, 2022 08:13
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request May 19, 2022
…scikit-learn#23410)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
glemaitre added a commit that referenced this pull request May 19, 2022
…#23410)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
lesteve added a commit to lesteve/scikit-learn that referenced this pull request May 19, 2022
…scikit-learn#23410)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Aug 4, 2022
…scikit-learn#23410)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
glemaitre added a commit that referenced this pull request Aug 5, 2022
…#23410)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
…scikit-learn#23410)



Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython module:tree To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants