Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: Make StackingRegressor support Multioutput #27704

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

hmasdev
Copy link
Contributor

@hmasdev hmasdev commented Nov 2, 2023

Reference Issues/PRs

Related to #25597
Similar to #8547
Similar to #19223

What does this implement/fix? Explain your changes.

  • Added the support for multioutput in StackingRegressor;
  • Added the test codes for above changes.
  • Update the docstring of StackingRegressor.

Any other comments?

I am concerned the followings:

  • Do we need any other tests?

Copy link

github-actions bot commented Nov 2, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bcf4e0e. Link to the linter CI: here

@adrinjalali
Copy link
Member

@OmarManzoor would you maybe have time to have a look here?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @hmasdev. A few comments. I think we will also need a changelog entry for this.

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved
sklearn/ensemble/_stacking.py Show resolved Hide resolved
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. Here are a few more changes. Also I think we are still not handling the case where if an estimator/regressor that does not support multioutput is specified. Or do we not need to worry about such a case?

Comment on lines 888 to 892
# NOTE: In this case the estimator can predict almost exactly the target
assert_allclose(
y_pred,
# NOTE: when the target is 2D but with a single output,
# the predictions are 1D because of column_or_1d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: In this case the estimator can predict almost exactly the target
assert_allclose(
y_pred,
# NOTE: when the target is 2D but with a single output,
# the predictions are 1D because of column_or_1d
# NOTE: In this case the estimator can predict almost exactly the target.
# When the target is 2D but with a single output the predictions are 1D
# because of column_or_1d
assert_allclose(
y_pred,

rtol=acceptable_relative_tolerance,
atol=acceptable_aboslute_tolerance,
)
# transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# transform

)

reg.fit(X_train, y_train)
# predict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# predict



def test_stacking_regressor_multioutput_with_passthrough():
"""Check that a stacking regressor with multioutput works"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check that a stacking regressor with multioutput works"""
"""Check that a stacking regressor with passthrough works with multioutput"""

rtol=acceptable_relative_tolerance,
atol=acceptable_aboslute_tolerance,
)
# transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# transform



def test_stacking_regressor_multioutput():
"""Check that a stacking regressor with multioutput works"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check that a stacking regressor with multioutput works"""
"""Check that a stacking regressor works with multioutput"""

@hmasdev
Copy link
Contributor Author

hmasdev commented May 8, 2024

@OmarManzoor
Thank you for more comments. I applied your suggestions to the test code.

Also I think we are still not handling the case where if an estimator/regressor that does not support multioutput is specified. Or do we not need to worry about such a case?

Actually, I don't have a good idea yet on how to handle an estimator that does not support multiple outputs when it is used in a multi-output problem.
In the current implementation, if an estimator that does not support multioutputs is used in a multioutput problem, ValueError occurs in that estimator as shown below.
If there was an API to determine if an estimator supports multi-output, it would be possible to handle this issue in StackingRegressor.fit.

Do you know such an API?

Python 3.10.13 (main, Feb 22 2024, 10:50:12) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.ensemble import StackingRegressor
>>> from sklearn.svm import SVR
>>> from sklearn.linear_model import LinearRegression
>>> lr  = LinearRegression()
>>> svr = SVR()
>>> model = StackingRegressor(estimators=[('lr', lr), ('svr', svr)])
>>> import numpy  as np
>>> X = np.random.randn(10, 2)
>>> Y = X ** 2
>>> model.fit(X, Y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 973, in fit
    return super().fit(X, y, sample_weight)
  File "/root/workspace/scikit-learn/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 224, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "/root/workspace/scikit-learn/sklearn/utils/parallel.py", line 67, in __call__
    return super().__call__(iterable_with_config)
  File "/root/workspace/scikit-learn/sklearn-env/lib/python3.10/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
  File "/root/workspace/scikit-learn/sklearn-env/lib/python3.10/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/utils/parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/ensemble/_base.py", line 40, in _fit_single_estimator
    estimator.fit(X, y, **fit_params)
  File "/root/workspace/scikit-learn/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/svm/_base.py", line 190, in fit
    X, y = self._validate_data(
  File "/root/workspace/scikit-learn/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1282, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1303, in _check_y
    y = column_or_1d(y, warn=True)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1370, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape (10, 2) instead.

@hmasdev
Copy link
Contributor Author

hmasdev commented May 8, 2024

Note that StackingClassifier is already available for multilabel classification problem but unavailable for multiclass-multioutput classification. I think that the latter is an issue that is out of scope this PR.

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.ensemble import StackingClassifier
>>> from sklearn.multioutput import MultiOutputClassifier
>>> X = np.random.randn(10, 2)
>>> Y = X > 0  # multilabel classification
>>> model = StackingClassifier(estimators=[('lr', MultiOutputClassifier(LogisticRegression(C=1e3))), ('lr2', MultiOutputClassifier(LogisticRegression(C=1e3)))], final_estimator=MultiOutputClassifier(LogisticRegression(C=1e3)))
>>> model.fit(X, Y)
StackingClassifier(estimators=[('lr',
                                MultiOutputClassifier(estimator=LogisticRegression(C=1000.0))),
                               ('lr2',
                                MultiOutputClassifier(estimator=LogisticRegression(C=1000.0)))],
                   final_estimator=MultiOutputClassifier(estimator=LogisticRegression(C=1000.0)))
>>> model.predict(X)[:3]
array([[ True,  True],
       [False,  True],
       [False, False]])
>>> model.predict_proba(X)[:3]
array([[1.55247883e-03, 7.05602027e-04],
       [9.99983741e-01, 2.31839536e-03],
       [9.99873471e-01, 9.99473360e-01]])
>>> Z = np.random.choice(range(3), size=X.shape)  # multiclass-multioutput classification
>>> model.fit(X, Z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 669, in fit
    self._label_encoder = LabelEncoder().fit(y)
  File "/root/workspace/scikit-learn/sklearn/preprocessing/_label.py", line 97, in fit
    y = column_or_1d(y, warn=True)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1370, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape (10, 2) instead.

Ref. https://scikit-learn.org/stable/modules/multiclass.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants