[WIP] Coreset for Kmeans and GaussianMixture clustering #799

remiadon · 2021-02-24T15:31:32Z

Introduce a Coreset Meta Estimator, that samples a subset of the original data (keeping the original geometrical shape) and pass this sample to a scikit-learn estimator

original paper

TODO

unit tests
documentation
.fit() returns the scikit-learn estimator, make it returning self (the dask meta estimator)
validate with plot and comparisons with other clustering algorithms, as in this page from scikit-learn)
compare Coreset + sklearn.KMeans with dask_ml.KMeans
check behaviour with refit param in GridSearchCV
try different coreset sizes for hard VS soft clustering, as mentionned in Section 6

remiadon · 2021-02-24T15:37:02Z

Currently I made this comparison, taking this sklearn doc as a baseline

The original data has 150k rows and 2 features, 200 points are sampled with the Coreset class

Notes
I experienced very poor running time performances with small chunks=(100, 2) when building dask.array
The current graph show running time with a chunk size of (10_000, 2)

remiadon · 2021-05-03T17:16:41Z

Here is a global view of the learning process

TomAugspurger · 2021-05-09T10:53:04Z

Thanks for continuing to work on this.

I experienced very poor running time performances with small chunks=(100, 2) when building dask.array

Depending on the scheduler there's a roughly 10-200 microsecond per task overhead (https://docs.dask.org/en/latest/institutional-faq.html?highlight=overhead#how-well-does-dask-scale-what-are-dask-s-limitations).

Can you think of anyone from the scikit-learn side who would be able to review this? I'm not too familiar with the coreset idea / Gaussian mixing. When you're ready, I can take a look at things from the Dask side.

remiadon · 2021-05-13T10:37:37Z

@TomAugspurger my pleasure. Concerning the review from the sklearn side, I think that Jérémie du Boisberranger worked on a set of runtime/memory benchmarks for KMeans, whose results have been presented at the 2019 scikit-learn consortium.

Concerning interactions with the sklearn.GaussianMixture, I guess we can ask @gmaze, who initially raised the issue on adding Gaussian Mixture Models. He certainly is more informed than I am on how we should validate that our Coreset class can be used (with care ?) with GaussianMixture. My only intuition is that the sklearn.GaussianMixture uses KMeans internally to init the cluster centers (see the init_params parameter). The plot at the top of this page visually confirms that the Coreset(GaussianMixture) instance succeeds in clustering the flattened forms in the middle, where KMeans fails.

Also, I would need help on how to properly compare runtime with sklearn. My current benchmark takes the sklearn.KMeans class as an opponent, but sklearn has a MiniBatchKMeans which is meant to be faster.
This is to be sure that

Coreset(Kmeans()).fit() runs as fast as MiniBatchKmeans for datasets that hold in memory
Coreset(Kmeans()).fit() can scale to out-of-memory datasets and still deliver good quality clustering

lightweight coreset v0

bacd996

remiadon mentioned this pull request Feb 24, 2021

Add GMM #113

Open

coreset first unit tests

fb672c9

remiadon added 2 commits May 7, 2021 23:03

Coreset [ADD] tests for inertia on KMeans, and refactor

83115e0

Merge branch 'main' of https://github.com/dask/dask-ml into coreset

ebbaac3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Coreset for Kmeans and GaussianMixture clustering #799

[WIP] Coreset for Kmeans and GaussianMixture clustering #799

remiadon commented Feb 24, 2021 •

edited

remiadon commented Feb 24, 2021

remiadon commented May 3, 2021 •

edited

TomAugspurger commented May 9, 2021

remiadon commented May 13, 2021 •

edited

[WIP] Coreset for Kmeans and GaussianMixture clustering #799

Are you sure you want to change the base?

[WIP] Coreset for Kmeans and GaussianMixture clustering #799

Conversation

remiadon commented Feb 24, 2021 • edited

remiadon commented Feb 24, 2021

remiadon commented May 3, 2021 • edited

TomAugspurger commented May 9, 2021

remiadon commented May 13, 2021 • edited

remiadon commented Feb 24, 2021 •

edited

remiadon commented May 3, 2021 •

edited

remiadon commented May 13, 2021 •

edited