[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

HenriqueProj · 2024-04-05T12:17:51Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative_outlier_factor_ (under -1e7), labeling them as outliers.

This fix checks if the minimum negative_outlier_factor_ is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.

Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired.
Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.

Any other comments?

…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.

github-actions · 2024-04-05T12:19:29Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: de442f0. Link to the linter CI: here}

ogrisel · 2024-04-10T07:42:22Z

I think I don't like the recursive automatic change to neighbors.

Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of n_neighbors by themselves.

Removed automatic change to neighbors number and changed the warning Also changed the associated test, to catch the warning.

HenriqueProj · 2024-04-22T14:32:27Z

@ogrisel Changed the code. Now it only raises a warning, as suggested.

sklearn/neighbors/_lof.py

Changed comment according to review Co-authored-by: Tim Head <betatim@gmail.com>

github-actions bot added the module:neighbors label Apr 5, 2024

HenriqueProj added 3 commits April 8, 2024 18:46

Redo checks: Codecov server error

19cb411

Update changelog and update tests for 100% test coverage

bc069b6

Fix changelog error

c6470c6

Fix: Changed approach according to review

909b25c

Removed automatic change to neighbors number and changed the warning Also changed the associated test, to catch the warning.

betatim reviewed Apr 22, 2024

View reviewed changes

sklearn/neighbors/_lof.py Outdated Show resolved Hide resolved

Update sklearn/neighbors/_lof.py

de442f0

Changed comment according to review Co-authored-by: Tim Head <betatim@gmail.com>

betatim approved these changes Apr 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

HenriqueProj commented Apr 5, 2024

github-actions bot commented Apr 5, 2024 •

edited

ogrisel commented Apr 10, 2024 •

edited

HenriqueProj commented Apr 22, 2024

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

Are you sure you want to change the base?

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

Conversation

HenriqueProj commented Apr 5, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Apr 5, 2024 • edited

✔️ Linting Passed

ogrisel commented Apr 10, 2024 • edited

HenriqueProj commented Apr 22, 2024

github-actions bot commented Apr 5, 2024 •

edited

ogrisel commented Apr 10, 2024 •

edited