Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide examples on how to customize the scikit-learn classes #28828

Open
miguelcsilva opened this issue Apr 13, 2024 · 7 comments
Open

Provide examples on how to customize the scikit-learn classes #28828

miguelcsilva opened this issue Apr 13, 2024 · 7 comments
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@miguelcsilva
Copy link
Contributor

Describe the issue linked to the documentation

Recently I add to implement my custom CV Splitter for a project I'm working on. My first instinct was to look in the documentation to see if there were any examples of how this could be done. I could not find anything too concrete, but after not too much time I found the Glossary of Common Terms and API Elements. Although not exactly what I hoped to find, it does have a section on CV Splitters. From there I can read that they expected to have a split and get_n_splits methods, and following some other links in the docs I can find what arguments they take and what they should return.

Although all the information is in fact there, I believe that more inexperienced users may find it a bit more difficult to piece together all the pieces, and was thinking if it wouldn't be beneficial for all users to have a section in the documentation with examples on how to customize the sci-kit learn classes to suit the user's needs. After all, I understand the library was developed with a API in mind that would allow for this exact flexibility and customization.

I know this is not a small task, and may add a non-trivial maintenance burden to the team, but would like to understand how the maintenance team would feel about a space in the documentation for these customization examples? Of course as the person suggesting I would be happy contribute for this.

Suggest a potential alternative/fix

One way I could see this taking shape would be with a dedicated page in the documentation, where examples of customized classes could be demonstrated. I think it's also important to show how the customized class would be used as part of a larger pipeline and allowing the user to copy and paste the code to their working environment.
I'll leave below of an example of a custom CV Splitter for discussion. But the idea would be to then expand to most commonly used classes.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)

class CustomSplitter:
    def __init__(self, n_folds=5) -> None:
        self.n_folds = n_folds

    def split(self, X = None, y = None, groups = None):
        assert X.shape[0] == y.shape[0]
        idxs = np.arange(X.shape[0])
        splits = np.array_split(idxs, self.get_n_splits())
        for split_idx, split in enumerate(splits):
            train_idxs = np.concatenate([split for idx, split in enumerate(splits) if idx != split_idx])
            test_idxs = split
            yield train_idxs, test_idxs

    def get_n_splits(self, X = None, y = None, groups = None):
        return self.n_folds

clf = LogisticRegression(random_state=42)
scores = cross_val_score(clf, X, y, cv=CustomSplitter(n_folds=5))
@miguelcsilva miguelcsilva added Documentation Needs Triage Issue requires triage labels Apr 13, 2024
@adrinjalali
Copy link
Member

This can certainly be an example (under our examples folder, with the right links from our user guides.

@adrinjalali adrinjalali added Moderate Anything that requires some knowledge of conventions and best practices help wanted and removed Needs Triage Issue requires triage labels Apr 15, 2024
@plon-Susk7
Copy link
Contributor

Hey @adrinjalali, I'd love to work on this issue. I am new to contributing to this repo. Any sort of advice/help will be greatly appreciated :)

@miguelcsilva
Copy link
Contributor Author

This can certainly be an example (under our examples folder, with the right links from our user guides.

Great. Would it be worth it to compile first a list of classes for which it would be useful to provide such customization examples? Or should we just start with this one and take it from there?

@adrinjalali
Copy link
Member

@plon-Susk7 this is more of an advanced issue. Probably working on easier issues for a while would be more fruitful. You can also look at the list of stalled and help wanted existing pull requests since a lot of them need somebody to pick it up and continue the work.

@miguelcsilva
Copy link
Contributor Author

This can certainly be an example (under our examples folder, with the right links from our user guides.

Great. Would it be worth it to compile first a list of classes for which it would be useful to provide such customization examples? Or should we just start with this one and take it from there?

@adrinjalali pinging in case you missed it. I'll have some more free time from Monday onwards, so could start working on this.

@adrinjalali
Copy link
Member

@miguelcsilva starting with an example for the splitters would be a good start.

@UriaMorP
Copy link

UriaMorP commented May 8, 2024

Hey @miguelcsilva , please consider the examples here
It extends sklearn's base classes (and function transformers) in order to apply sklearn Pipelines to models that gets higher order tensors as input (where len(data.shape) >= 3).

A concrete from this page is the Patch class, which is designed to stitch two sklearn pipelines (pre-process pipeline and supervised learning pipeline) and a "special" step of working with higher-order tensor data.

Given a pandas dataframe with dimension (m, p*n) (with multi level index specifying its higher order structure), Patch will

  1. fit the parameters for pipeline1 with respect to the input,
  2. apply a table2tensor operation which results in a higher order tensor (order 3 in this case)
  3. fit the parameters of a TCAM model which transforms (1, p, n) shape tensors to feature vectors with shape (q,)
  4. fit the parameters in subsequent pipeline2 with respect to the (m, q) table generated in the previous step.

The .transform() method is implemented basically the same:

  1. apply pipeline1.transform to the input (size `(r,p*n) )
  2. apply table2tensor and get a shape= (r,p,n) array
  3. apply TCAM, which results in a shape (r,q) table
  4. apply pipeline2.transform
class Patch(BaseEstimator, TransformerMixin):
    def __init__(self,pipeline1, tcam_obj):
        self.pipeline1 = pipeline1
        self.tcam_obj = tcam_obj

    def _transform_y(self, y, map1):
        yt = pd.Series(map1).rename('row').to_frame()
        yt = yt.merge(y.reset_index(), left_index = True, right_on = "SubjectID").drop_duplicates()
        yt.index = yt['row']
        yt = yt['label'].sort_index()
        return yt


    def fit(self, X, y = None):
        self.pipeline1 = self.pipeline1.fit(X,y=y)
        tensor, map1, map3 = table2tensor(self.pipeline1.transform(X))

        if y is not None:
            yt = self._transform_y(y,map1)
        else:
            yt = None

        self.tcam_obj = self.tcam_obj.fit(tensor,y=yt)
        return self

    def transform(self, X):
        Xt1 = self.pipeline1.transform(X)
        tensor, map1, map3 = table2tensor(Xt1)
        Xt2 = self.tcam_obj.transform(tensor)
        return Xt2

I hope you will find this example interesting.
Regardless, please let me know if there's anything I can do to assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

4 participants