Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing variations of the BIRCH clustering algorithm #28778

Open
jolespin opened this issue Apr 5, 2024 · 1 comment
Open

Implementing variations of the BIRCH clustering algorithm #28778

jolespin opened this issue Apr 5, 2024 · 1 comment

Comments

@jolespin
Copy link

jolespin commented Apr 5, 2024

Describe the workflow you want to enable

Currently this only the basic implementation of the BIRCH clustering algorithm.
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html

Just as there is DBSCAN and HDBSCAN, it would be helpful if there was something like ABirch and MBDBirch classes as well.

Describe your proposed solution

2 additional classes for ABirch and MBDBirch implementations

Describe alternatives you've considered, if relevant

No response

Additional context

Clustering algorithms are recently regaining attention with the availability of large datasets and the rise of parallelized computing architectures. However, most clustering algorithms suffer from two drawbacks: they do not scale well with increasing dataset sizes and often require proper parametrization which is usually difficult to provide. A very important example is the cluster count, a parameter that in many situations is next to impossible to assess. In this paper we present A-BIRCH, an approach for automatic threshold estimation for the BIRCH clustering algorithm. This approach computes the optimal threshold parameter of BIRCH from the data, such that BIRCH does proper clustering even without the global clustering phase that is usually the final step of BIRCH. This is possible if the data satisfies certain constraints. If those constraints are not satisfied, A-BIRCH will issue a pertinent warning before presenting the results. This approach renders the final global clustering step of BIRCH unnecessary in many situations, which results in two advantages. First, we do not need to know the expected number of clusters beforehand. Second, without the computationally expensive final clustering, the fast BIRCH algorithm will become even faster. For very large data sets, we introduce another variation of BIRCH, which we call MBD-BIRCH, which is of particular advantage in conjunction with A-BIRCH but is independent from it and also of general benefit.

https://www.sciencedirect.com/science/article/pii/S2214579617300151

@jolespin jolespin added Needs Triage Issue requires triage New Feature labels Apr 5, 2024
@lesteve
Copy link
Member

lesteve commented Apr 8, 2024

Thanks for opening an issue!

I am not very familiar with this literature, but I quickly had a look and the paper you mention does not seem to meet your our inclusion criteria: https://scikit-learn.org/stable/faq.html#new-algorithms-inclusion-criteria. In particular it has only 100 citations on Google Scholar.

@lesteve lesteve removed the Needs Triage Issue requires triage label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants