Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN does not make outliers clear #33

Open
lilyminium opened this issue Mar 1, 2020 · 2 comments
Open

DBSCAN does not make outliers clear #33

lilyminium opened this issue Mar 1, 2020 · 2 comments
Assignees

Comments

@lilyminium
Copy link
Member

lilyminium commented Mar 1, 2020

Expected behavior

DBSCAN is a clustering method that can identify outliers. I expect these outliers to be clearly indicated in some way. I also expect that outliers are treated properly in similarity measures.

Actual behavior

The method implemented in MDAnalysis makes the outlier group (label==-1) look like an actual cluster (labels start from 0). (although this doesn't ultimately matter, as encode_centroid_info drops label information anyway)

https://github.com/MDAnalysis/mdanalysis/blob/9bcf6f4c118e1ea137e8514bd60cbd1cd1972062/package/MDAnalysis/analysis/encore/clustering/ClusteringMethod.py#L300-L306

Also, calling the first frame in the cluster the centroid, and not mentioning this very clearly in the docs seems like a bad idea. This also gives the outlier group a centroid.

Finally, ClusterCollection does not keep the cluster labels. This makes it hard to look for special (i.e. negative) cluster labels.

Currently version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 0.20.2-dev
  • Which version of Python (python -V)?
  • Which operating system?

Possible fix

Easy option

  • Don't alter DBSCAN's output
  • Add a warning and note in the docs that the "centroid" is the first frame of that cluster
  • Figure out first frame in the outlier group (which becomes the "centroid" in ClusterCollection) and add a warning that it's not a real cluster

More work option

  • Reconstruct the ClusterCollection class with a more intuitive interface
    • each cluster should not require a centroid
    • each cluster should retain its label from scipy
    • should be able to label a "cluster" as outliers
    • it would be nice to link the cluster members to frames of the universes in the ensemble
@mtiberti mtiberti self-assigned this Mar 2, 2020
@lilyminium
Copy link
Member Author

This also results in issues for ensemble similarity analysis. The outlier "cluster" is treated like a real cluster. Therefore, if a conformation in trajectory A is in the outlier cluster and a conformation in trajectory B is in the outlier cluster, it is treated as a point of similarity -- in reality these conformations should be unrelated.

@IAlibay IAlibay transferred this issue from MDAnalysis/mdanalysis Sep 6, 2023
@mtiberti
Copy link

We will update the documentation and code to add a warning when DBScan is being used - so that users are aware of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants