Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

HenriqueProj
Copy link

Reference Issues/PRs

Fixes #27839

What does this implement/fix? Explain your changes.

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative_outlier_factor_ (under -1e7), labeling them as outliers.

This fix checks if the minimum negative_outlier_factor_ is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.

Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired.
Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.

Any other comments?

…cated samples

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers.
This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.
Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
Copy link

github-actions bot commented Apr 5, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: de442f0. Link to the linter CI: here

@ogrisel
Copy link
Member

ogrisel commented Apr 10, 2024

I think I don't like the recursive automatic change to neighbors.

Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of n_neighbors by themselves.

Removed automatic change to neighbors number and changed the warning
Also changed the associated test, to catch the warning.
@HenriqueProj
Copy link
Author

@ogrisel Changed the code. Now it only raises a warning, as suggested.

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved
Changed comment according to review

Co-authored-by: Tim Head <betatim@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LocalOutlierFactor might not work with duplicated samples
3 participants