New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocalOutlierFactor might not work with duplicated samples #27839
Comments
Another potential approach: Adding a toggle parameter that allows programmers to manually handle duplicate samples. This parameter can be used to indicate whether the algorithm should automatically identify and handle duplicate samples, or allow programmers to intervene and handle them manually, exploring other strategies as well. In terms of implementation, for example, introduce a parameter named This design can provide more flexible control, allowing developers to choose how to handle duplicate samples as needed while retaining the automation features of the algorithm. This toggle parameter can serve as an additional option to address the issue described in the text and offer more customization options. |
@glemaitre @TinusChen, could I work on this issue? Looks like a great opportunity for me to contribute to this project. |
…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
This an investigation from the discussion in #27838
LocalFactorOutlier
might be difficult to use when there are duplicate values larger thenn_neighbors
. In this case, the distance for these neighbors is0
, meaning that the local reachibility density is therefore infinite (or in the algorithm1 / 1e-10
). The issue starts for sample next to those local peaky density: they might use the1 / 1e-10
as measure, meaning that they will have a really negativenegative_local_outlier
while the value of the sample could be really close to the one of the plateau. I will now provide a minimum sythetic example to show the issue:In the results above, we see that the first and last values should not be considered as outliers but because they have their neighbors coming from the constant part (i.e. plateau at 0.1), the local reachibility density is 1e10 and thus the negative outlier factor is set to -1e7.
Running the same code without the constant part will give:
which is more what one would expect.
Now, my question would be if we can have a strategy to limit such a corner case that is ill-defined currently algorithmically? A potential solution is to find such extreme value and raise a warning and mentioning that there are duplicate and increasing
n_neighbors
could alleviate the problem?ping @ngoix @albertcthomas @agramfort if you have any input.
The text was updated successfully, but these errors were encountered: