Provide examples on how to customize the scikit-learn classes #28828

miguelcsilva · 2024-04-13T14:21:24Z

Describe the issue linked to the documentation

Recently I add to implement my custom CV Splitter for a project I'm working on. My first instinct was to look in the documentation to see if there were any examples of how this could be done. I could not find anything too concrete, but after not too much time I found the Glossary of Common Terms and API Elements. Although not exactly what I hoped to find, it does have a section on CV Splitters. From there I can read that they expected to have a split and get_n_splits methods, and following some other links in the docs I can find what arguments they take and what they should return.

Although all the information is in fact there, I believe that more inexperienced users may find it a bit more difficult to piece together all the pieces, and was thinking if it wouldn't be beneficial for all users to have a section in the documentation with examples on how to customize the sci-kit learn classes to suit the user's needs. After all, I understand the library was developed with a API in mind that would allow for this exact flexibility and customization.

I know this is not a small task, and may add a non-trivial maintenance burden to the team, but would like to understand how the maintenance team would feel about a space in the documentation for these customization examples? Of course as the person suggesting I would be happy contribute for this.

Suggest a potential alternative/fix

One way I could see this taking shape would be with a dedicated page in the documentation, where examples of customized classes could be demonstrated. I think it's also important to show how the customized class would be used as part of a larger pipeline and allowing the user to copy and paste the code to their working environment.
I'll leave below of an example of a custom CV Splitter for discussion. But the idea would be to then expand to most commonly used classes.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)

class CustomSplitter:
    def __init__(self, n_folds=5) -> None:
        self.n_folds = n_folds

    def split(self, X = None, y = None, groups = None):
        assert X.shape[0] == y.shape[0]
        idxs = np.arange(X.shape[0])
        splits = np.array_split(idxs, self.get_n_splits())
        for split_idx, split in enumerate(splits):
            train_idxs = np.concatenate([split for idx, split in enumerate(splits) if idx != split_idx])
            test_idxs = split
            yield train_idxs, test_idxs

    def get_n_splits(self, X = None, y = None, groups = None):
        return self.n_folds

clf = LogisticRegression(random_state=42)
scores = cross_val_score(clf, X, y, cv=CustomSplitter(n_folds=5))

The text was updated successfully, but these errors were encountered:

adrinjalali · 2024-04-15T07:07:40Z

This can certainly be an example (under our examples folder, with the right links from our user guides.

plon-Susk7 · 2024-04-17T16:43:31Z

Hey @adrinjalali, I'd love to work on this issue. I am new to contributing to this repo. Any sort of advice/help will be greatly appreciated :)

miguelcsilva · 2024-04-17T17:10:38Z

This can certainly be an example (under our examples folder, with the right links from our user guides.

Great. Would it be worth it to compile first a list of classes for which it would be useful to provide such customization examples? Or should we just start with this one and take it from there?

adrinjalali · 2024-04-18T07:51:05Z

@plon-Susk7 this is more of an advanced issue. Probably working on easier issues for a while would be more fruitful. You can also look at the list of stalled and help wanted existing pull requests since a lot of them need somebody to pick it up and continue the work.

miguelcsilva · 2024-04-25T06:56:55Z

This can certainly be an example (under our examples folder, with the right links from our user guides.

Great. Would it be worth it to compile first a list of classes for which it would be useful to provide such customization examples? Or should we just start with this one and take it from there?

@adrinjalali pinging in case you missed it. I'll have some more free time from Monday onwards, so could start working on this.

adrinjalali · 2024-04-29T12:24:32Z

@miguelcsilva starting with an example for the splitters would be a good start.

UriaMorP · 2024-05-08T18:18:17Z

Hey @miguelcsilva , please consider the examples here
It extends sklearn's base classes (and function transformers) in order to apply sklearn Pipelines to models that gets higher order tensors as input (where len(data.shape) >= 3).

A concrete from this page is the Patch class, which is designed to stitch two sklearn pipelines (pre-process pipeline and supervised learning pipeline) and a "special" step of working with higher-order tensor data.

Given a pandas dataframe with dimension (m, p*n) (with multi level index specifying its higher order structure), Patch will

fit the parameters for pipeline1 with respect to the input,
apply a table2tensor operation which results in a higher order tensor (order 3 in this case)
fit the parameters of a TCAM model which transforms (1, p, n) shape tensors to feature vectors with shape (q,)
fit the parameters in subsequent pipeline2 with respect to the (m, q) table generated in the previous step.

The .transform() method is implemented basically the same:

apply pipeline1.transform to the input (size `(r,p*n) )
apply table2tensor and get a shape= (r,p,n) array
apply TCAM, which results in a shape (r,q) table
apply pipeline2.transform

class Patch(BaseEstimator, TransformerMixin):
    def __init__(self,pipeline1, tcam_obj):
        self.pipeline1 = pipeline1
        self.tcam_obj = tcam_obj

    def _transform_y(self, y, map1):
        yt = pd.Series(map1).rename('row').to_frame()
        yt = yt.merge(y.reset_index(), left_index = True, right_on = "SubjectID").drop_duplicates()
        yt.index = yt['row']
        yt = yt['label'].sort_index()
        return yt


    def fit(self, X, y = None):
        self.pipeline1 = self.pipeline1.fit(X,y=y)
        tensor, map1, map3 = table2tensor(self.pipeline1.transform(X))

        if y is not None:
            yt = self._transform_y(y,map1)
        else:
            yt = None

        self.tcam_obj = self.tcam_obj.fit(tensor,y=yt)
        return self

    def transform(self, X):
        Xt1 = self.pipeline1.transform(X)
        tensor, map1, map3 = table2tensor(Xt1)
        Xt2 = self.tcam_obj.transform(tensor)
        return Xt2

I hope you will find this example interesting.
Regardless, please let me know if there's anything I can do to assist.

miguelcsilva added Documentation Needs Triage Issue requires triage labels Apr 13, 2024

adrinjalali added Moderate Anything that requires some knowledge of conventions and best practices help wanted and removed Needs Triage Issue requires triage labels Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide examples on how to customize the scikit-learn classes #28828

Provide examples on how to customize the scikit-learn classes #28828

miguelcsilva commented Apr 13, 2024

adrinjalali commented Apr 15, 2024

plon-Susk7 commented Apr 17, 2024

miguelcsilva commented Apr 17, 2024

adrinjalali commented Apr 18, 2024

miguelcsilva commented Apr 25, 2024

adrinjalali commented Apr 29, 2024

UriaMorP commented May 8, 2024

Provide examples on how to customize the scikit-learn classes #28828

Provide examples on how to customize the scikit-learn classes #28828

Comments

miguelcsilva commented Apr 13, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

adrinjalali commented Apr 15, 2024

plon-Susk7 commented Apr 17, 2024

miguelcsilva commented Apr 17, 2024

adrinjalali commented Apr 18, 2024

miguelcsilva commented Apr 25, 2024

adrinjalali commented Apr 29, 2024

UriaMorP commented May 8, 2024