Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Coreset for Kmeans and GaussianMixture clustering #799

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

remiadon
Copy link

@remiadon remiadon commented Feb 24, 2021

Introduce a Coreset Meta Estimator, that samples a subset of the original data (keeping the original geometrical shape) and pass this sample to a scikit-learn estimator

original paper

TODO

  • unit tests
  • documentation
  • .fit() returns the scikit-learn estimator, make it returning self (the dask meta estimator)
  • validate with plot and comparisons with other clustering algorithms, as in this page from scikit-learn)
  • compare Coreset + sklearn.KMeans with dask_ml.KMeans
  • check behaviour with refit param in GridSearchCV
  • try different coreset sizes for hard VS soft clustering, as mentionned in Section 6

@remiadon
Copy link
Author

Currently I made this comparison, taking this sklearn doc as a baseline

Unknown-3

The original data has 150k rows and 2 features, 200 points are sampled with the Coreset class

Notes
I experienced very poor running time performances with small chunks=(100, 2) when building dask.array
The current graph show running time with a chunk size of (10_000, 2)

@remiadon remiadon mentioned this pull request Feb 24, 2021
@remiadon
Copy link
Author

remiadon commented May 3, 2021

Here is a global view of the learning process

@TomAugspurger
Copy link
Member

Thanks for continuing to work on this.

I experienced very poor running time performances with small chunks=(100, 2) when building dask.array

Depending on the scheduler there's a roughly 10-200 microsecond per task overhead (https://docs.dask.org/en/latest/institutional-faq.html?highlight=overhead#how-well-does-dask-scale-what-are-dask-s-limitations).

Can you think of anyone from the scikit-learn side who would be able to review this? I'm not too familiar with the coreset idea / Gaussian mixing. When you're ready, I can take a look at things from the Dask side.

@remiadon
Copy link
Author

remiadon commented May 13, 2021

@TomAugspurger my pleasure. Concerning the review from the sklearn side, I think that Jérémie du Boisberranger worked on a set of runtime/memory benchmarks for KMeans, whose results have been presented at the 2019 scikit-learn consortium.

Concerning interactions with the sklearn.GaussianMixture, I guess we can ask @gmaze, who initially raised the issue on adding Gaussian Mixture Models. He certainly is more informed than I am on how we should validate that our Coreset class can be used (with care ?) with GaussianMixture. My only intuition is that the sklearn.GaussianMixture uses KMeans internally to init the cluster centers (see the init_params parameter). The plot at the top of this page visually confirms that the Coreset(GaussianMixture) instance succeeds in clustering the flattened forms in the middle, where KMeans fails.

Also, I would need help on how to properly compare runtime with sklearn. My current benchmark takes the sklearn.KMeans class as an opponent, but sklearn has a MiniBatchKMeans which is meant to be faster.
This is to be sure that

  • Coreset(Kmeans()).fit() runs as fast as MiniBatchKmeans for datasets that hold in memory
  • Coreset(Kmeans()).fit() can scale to out-of-memory datasets and still deliver good quality clustering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants