Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_samples in HDSCAN #28976

Open
schae211 opened this issue May 8, 2024 · 2 comments
Open

min_samples in HDSCAN #28976

schae211 opened this issue May 8, 2024 · 2 comments

Comments

@schae211
Copy link

schae211 commented May 8, 2024

Describe the issue linked to the documentation

I find the description of the min_samples argument in sklearn.cluster.HDBSCAN confusing.

It says "The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself."

But if I understand everything correctly min_samples corresponds to the $k$ used to compute the core distance $\text{core}_k\left(x\right)$ for every sample $x$ where the $k$'th core distance for some sample $x$ is defined as the distance to the $k$'th nearest-neighbor of $x$ (counting itself). (-> which exactly what is happening in the code here: https://github.com/scikit-learn-contrib/hdbscan/blob/fc94241a4ecf5d3668cbe33b36ef03e6160d7ab7/hdbscan/_hdbscan_reachability.pyx#L45-L47, where it is called min_points)

I don't understand how both of these descriptions are equivalent. I would assume that other people might find that confusing as well.

Link in Code:

min_samples : int, default=None
The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
When `None`, defaults to `min_cluster_size`.

Link in Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

Suggest a potential alternative/fix

No response

@schae211 schae211 added Documentation Needs Triage Issue requires triage labels May 8, 2024
@adrinjalali
Copy link
Member

cc @Micky774

@glemaitre
Copy link
Member

Indeed, we used the docstring of the original implementation that reused the DBSCAN information. However, the parameter here have a different meaning: it define the core distance.

So we should make sure to change the different docstrings from the file.

@glemaitre glemaitre added help wanted and removed Needs Triage Issue requires triage labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants