LocalOutlierFactor might not work with duplicated samples #27839

glemaitre · 2023-11-24T16:21:55Z

This an investigation from the discussion in #27838

LocalFactorOutlier might be difficult to use when there are duplicate values larger then n_neighbors. In this case, the distance for these neighbors is 0, meaning that the local reachibility density is therefore infinite (or in the algorithm 1 / 1e-10). The issue starts for sample next to those local peaky density: they might use the 1 / 1e-10 as measure, meaning that they will have a really negative negative_local_outlier while the value of the sample could be really close to the one of the plateau. I will now provide a minimum sythetic example to show the issue:

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

rng = np.random.default_rng(0)
x = rng.permutation(np.hstack([
    [0.1] * 10,  # constant values
    np.linspace(0.1, 0.2, num=30),
    rng.random(5) * 100  # the clear outliers
]))
X = x.reshape(-1, 1)

lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1)
outliers = lof.fit_predict(X)

indices = np.where(outliers == -1)
# check that shows that outliers can be found from the linspace
print(X[indices])

print(lof.negative_outlier_factor_[indices])

array([[ 0.10344828],
       [26.97867138],
       [81.32702392],
       [63.69616873],
       [ 0.10689655]])

array([-3.31034492e+07, -1.22583005e+03, -1.10083976e+03, -9.37195958e+02,
       -4.13793114e+07])

In the results above, we see that the first and last values should not be considered as outliers but because they have their neighbors coming from the constant part (i.e. plateau at 0.1), the local reachibility density is 1e10 and thus the negative outlier factor is set to -1e7.

Running the same code without the constant part will give:

import numpy as np
from sklearn.neighbors import LocalOutlierFactor

rng = np.random.default_rng(0)
x = rng.permutation(np.hstack([
    np.linspace(0.1, 0.2, num=30),
    rng.random(5) * 100  # the clear outliers
]))
X = x.reshape(-1, 1)

lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1)
outliers = lof.fit_predict(X)

indices = np.where(outliers == -1)
# check that shows that outliers can be found from the linspace
print(X[indices])

print(lof.negative_outlier_factor_[indices])

[[81.32702392]
 [63.69616873]
 [ 4.09735239]
 [26.97867138]]
[-1100.83976485  -937.19595768  -237.40975167 -1225.83005019]

which is more what one would expect.

Now, my question would be if we can have a strategy to limit such a corner case that is ill-defined currently algorithmically? A potential solution is to find such extreme value and raise a warning and mentioning that there are duplicate and increasing n_neighbors could alleviate the problem?

ping @ngoix @albertcthomas @agramfort if you have any input.

The text was updated successfully, but these errors were encountered:

TinusChen · 2024-03-15T10:04:58Z

Another potential approach:

Adding a toggle parameter that allows programmers to manually handle duplicate samples. This parameter can be used to indicate whether the algorithm should automatically identify and handle duplicate samples, or allow programmers to intervene and handle them manually, exploring other strategies as well.

In terms of implementation, for example, introduce a parameter named handle_duplicates (or similar) with a default value set to automatically handle duplicate samples. When developers encounter the issue described above, they can set this parameter to False, thereby preventing the algorithm from automatically handling duplicate samples and instead allowing developers to handle these cases themselves.

This design can provide more flexible control, allowing developers to choose how to handle duplicate samples as needed while retaining the automation features of the algorithm. This toggle parameter can serve as an additional option to address the issue described in the text and offer more customization options.

HenriqueProj · 2024-03-18T16:26:28Z

@glemaitre @TinusChen, could I work on this issue? Looks like a great opportunity for me to contribute to this project.

…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.

github-actions bot added the Needs Triage Issue requires triage label Nov 24, 2023

glemaitre added Bug and removed Needs Triage Issue requires triage labels Nov 24, 2023

ArturoAmorQ changed the title ~~LocalFactorOutlier might not work with duplicated samples~~ LocalOutlierFactor might not work with duplicated samples Mar 1, 2024

HenriqueProj linked a pull request Apr 5, 2024 that will close this issue

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalOutlierFactor might not work with duplicated samples #27839

LocalOutlierFactor might not work with duplicated samples #27839

glemaitre commented Nov 24, 2023

TinusChen commented Mar 15, 2024 •

edited

HenriqueProj commented Mar 18, 2024

LocalOutlierFactor might not work with duplicated samples #27839

LocalOutlierFactor might not work with duplicated samples #27839

Comments

glemaitre commented Nov 24, 2023

TinusChen commented Mar 15, 2024 • edited

HenriqueProj commented Mar 18, 2024

TinusChen commented Mar 15, 2024 •

edited