Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalOutlierFactor might not work with duplicated samples #27839

Open
glemaitre opened this issue Nov 24, 2023 · 2 comments · May be fixed by #28773
Open

LocalOutlierFactor might not work with duplicated samples #27839

glemaitre opened this issue Nov 24, 2023 · 2 comments · May be fixed by #28773
Labels

Comments

@glemaitre
Copy link
Member

This an investigation from the discussion in #27838

LocalFactorOutlier might be difficult to use when there are duplicate values larger then n_neighbors. In this case, the distance for these neighbors is 0, meaning that the local reachibility density is therefore infinite (or in the algorithm 1 / 1e-10). The issue starts for sample next to those local peaky density: they might use the 1 / 1e-10 as measure, meaning that they will have a really negative negative_local_outlier while the value of the sample could be really close to the one of the plateau. I will now provide a minimum sythetic example to show the issue:

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

rng = np.random.default_rng(0)
x = rng.permutation(np.hstack([
    [0.1] * 10,  # constant values
    np.linspace(0.1, 0.2, num=30),
    rng.random(5) * 100  # the clear outliers
]))
X = x.reshape(-1, 1)

lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1)
outliers = lof.fit_predict(X)

indices = np.where(outliers == -1)
# check that shows that outliers can be found from the linspace
print(X[indices])

print(lof.negative_outlier_factor_[indices])
array([[ 0.10344828],
       [26.97867138],
       [81.32702392],
       [63.69616873],
       [ 0.10689655]])

array([-3.31034492e+07, -1.22583005e+03, -1.10083976e+03, -9.37195958e+02,
       -4.13793114e+07])

In the results above, we see that the first and last values should not be considered as outliers but because they have their neighbors coming from the constant part (i.e. plateau at 0.1), the local reachibility density is 1e10 and thus the negative outlier factor is set to -1e7.

Running the same code without the constant part will give:

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

rng = np.random.default_rng(0)
x = rng.permutation(np.hstack([
    np.linspace(0.1, 0.2, num=30),
    rng.random(5) * 100  # the clear outliers
]))
X = x.reshape(-1, 1)

lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1)
outliers = lof.fit_predict(X)

indices = np.where(outliers == -1)
# check that shows that outliers can be found from the linspace
print(X[indices])

print(lof.negative_outlier_factor_[indices])
[[81.32702392]
 [63.69616873]
 [ 4.09735239]
 [26.97867138]]
[-1100.83976485  -937.19595768  -237.40975167 -1225.83005019]

which is more what one would expect.

Now, my question would be if we can have a strategy to limit such a corner case that is ill-defined currently algorithmically? A potential solution is to find such extreme value and raise a warning and mentioning that there are duplicate and increasing n_neighbors could alleviate the problem?

ping @ngoix @albertcthomas @agramfort if you have any input.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Nov 24, 2023
@glemaitre glemaitre added Bug and removed Needs Triage Issue requires triage labels Nov 24, 2023
@ArturoAmorQ ArturoAmorQ changed the title LocalFactorOutlier might not work with duplicated samples LocalOutlierFactor might not work with duplicated samples Mar 1, 2024
@TinusChen
Copy link

TinusChen commented Mar 15, 2024

Another potential approach:

Adding a toggle parameter that allows programmers to manually handle duplicate samples. This parameter can be used to indicate whether the algorithm should automatically identify and handle duplicate samples, or allow programmers to intervene and handle them manually, exploring other strategies as well.

In terms of implementation, for example, introduce a parameter named handle_duplicates (or similar) with a default value set to automatically handle duplicate samples. When developers encounter the issue described above, they can set this parameter to False, thereby preventing the algorithm from automatically handling duplicate samples and instead allowing developers to handle these cases themselves.

This design can provide more flexible control, allowing developers to choose how to handle duplicate samples as needed while retaining the automation features of the algorithm. This toggle parameter can serve as an additional option to address the issue described in the text and offer more customization options.

@HenriqueProj
Copy link

@glemaitre @TinusChen, could I work on this issue? Looks like a great opportunity for me to contribute to this project.

HenriqueProj added a commit to HenriqueProj/scikit-learn that referenced this issue Apr 3, 2024
…cated samples

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers.
This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.
Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
HenriqueProj added a commit to HenriqueProj/scikit-learn that referenced this issue Apr 5, 2024
…cated samples

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers.
This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.
Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants