Simple outlier removal transformers #28874

Aukevanoost · 2024-04-22T11:46:56Z

Aukevanoost
Apr 22, 2024

TL;DR: Idea to add simple outlier-removal transformers

To whom it may concern,

I am relatively new to the whole ML scene and I found out that a lot of transformers are very complex and not very customizable. Therefore I was wondering if it makes sense to add some very simple outlier removal transformers as can be seen in the code snippet below: This is a rough sketch to give a feeling of how they might work

pipeline = Pipeline(steps=[
    (
        'IQR outlier removal', 
        OutlierRemover(z_score_strategy(factor=1.5), columns=["col1", "col2"])
    ),
    (
        'Threshold outlier removal', 
        OutlierRemover(minmax_strategy(min=0, max=30_000), columns=["col3"])
        # OutlierRemover(lambda (X, f): (0, 30_000), columns=["col3"])
    ),
])

or perhaps the code could be changed to a more simpler approach, but less flexible

pipeline = Pipeline(steps=[
    ( 'IQR outlier removal',  OutlierRemover(strategy='iqr', columns=["col1", "col2"]) ),
    ( 'Threshold outlier removal',  OutlierRemover(strategy='minmax', range=(0,30_000), columns=["col3"]) ), # from domain analysis
])

This way more domain-focused and fine-grained filtering can be applied to columns without having to write a lot of boilerplate.

Below is some pseudocode on how these OutlierRemovers can be implemented (warning, Python is not my native tongue)

I'd like to know if (and why) this hasn't been implemented yet, or if I am misunderstanding how the pipeline and tranformers works.

Any thoughts are welcome.

Cheers

### CUSTOM OUTLIER REMOVAL TRANSFORMERS ###
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.exceptions import NotFittedError
from typing import Callable, Tuple

# type strategy = Callable[[pd.DataFrame, str], (float, float)]

## OUTLIER BASE CLASS
class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, strat: Callable[[pd.DataFrame, str], Tuple[float, float]], columns=None):        
        self.strat = strat  
        self.columns = columns  
        self.fitted = False

    def fit(self, X, y=None):
        self.range = {f: self.strat(X,f) for f in self._cols(X)}
        self.fitted = True
        return self

    def transform(self, X: pd.DataFrame,y=None):
        if not self.fitted: raise NotFittedError()
        filters = reduce(lambda filters, d: filters & self._append_filter(X, d), self._cols(X), True) 
        return X.copy()[filters]
    
    def _append_filter(self, X: pd.DataFrame, f: str) -> bool:
        min, max = self.range[f]
        return (X[f] >= min) & (X[f] <= max)
    
    def _cols(self, X: pd.DataFrame): return self.columns or list(X.columns)


### 3 IMPLEMENTED RANGE STRATEGIES, COULD BE EXPANDABLE 

def z_score_strategy(factor=3) -> Callable[[pd.DataFrame, str], Tuple[float, float]]:
    def fit_column( X: pd.DataFrame, f: str):
        mean = X[f].mean()
        sd = X[f].std()
        return (mean - sd * factor, mean + sd * factor)
    
    return fit_column


def iqr_score_strategy(factor=1.5) -> Callable[[pd.DataFrame, str], Tuple[float, float]]:
    def fit_column( X: pd.DataFrame, f: str):
        Q1 = X[f].quantile(0.25)
        Q3 = X[f].quantile(0.75)
        IQR = Q3 - Q1
        return(Q1 - factor * IQR, Q3 + factor * IQR)
    
    return fit_column


def minmax_strategy(min=0, max=1) -> Callable[[pd.DataFrame, str], Tuple[float, float]]:
    def fit_column( X: pd.DataFrame, f: str):
        return(min, max)
    return fit_column

adrinjalali · 2024-04-22T16:08:03Z

adrinjalali
Apr 22, 2024
Maintainer

We don't have anything that changes the number of samples. It brings a lot of complications. But we've had plenty of discussions about it. Maybe most relevant would be: scikit-learn/enhancement_proposals#15

1 reply

Aukevanoost Apr 22, 2024
Author

Ah ok that explains it, cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple outlier removal transformers #28874

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Simple outlier removal transformers #28874

Aukevanoost Apr 22, 2024

Replies: 1 comment · 1 reply

adrinjalali Apr 22, 2024 Maintainer

Aukevanoost Apr 22, 2024 Author

Aukevanoost
Apr 22, 2024

Replies: 1 comment 1 reply

adrinjalali
Apr 22, 2024
Maintainer

Aukevanoost Apr 22, 2024
Author