Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add ReducedRankRegression estimator (Resolves#10796) #28779

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mdmelin
Copy link

@mdmelin mdmelin commented Apr 6, 2024

I've added a ReducedRankRegression estimator (resolves #10796) . It seems to be behaving as expected, as shown below.
rr_reg
I did this by extending sklearn.linearmodel.Ridge because I thought it'd be nice to optionally apply a ridge penalty too.

Where I'm stuck is on how to handle some tests: I'm currently failing a few of them because reduced rank regression (specifically the underlying SVD operation) don't work (and are pointless) when there is only one target to predict. Any guidance on what to do here is appreciated.

Also if inheriting from an existing estimator is inappropriate, I can implement more of the functionality myself. This is my first contribution attempt so still getting familiar with the codebase...

Copy link

github-actions bot commented Apr 6, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=23.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/cross_decomposition/_pls.py	2024-04-06 07:55:40.504228 +0000
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/cross_decomposition/_pls.py	2024-04-06 07:55:51.678800 +0000
@@ -182,10 +182,11 @@
 def _deprecate_Y_when_required(y, Y):
     if y is None and Y is None:
         raise ValueError("y is required.")
     return _deprecate_Y_when_optional(y, Y)
 
+
 class ReducedRankRegression(Ridge):
     """Reduced rank regression.
 
     ReducedRankRegression enforces a low-rank constraint on the beta coefficients. If beta is
     a [p x q] matrix in the case of normal regression, reduced rank regression enforces rank(beta) <= rank
@@ -242,38 +243,52 @@
     >>> rrr.fit(X, y)
     ReducedRankRegression()
     >>> Y_pred = rrr.predict(X)
     """
 
-
     _parameter_constraints: dict = {
-    "rank": [Interval(Integral, 1, None, closed="left")],
-    "alpha": [Interval(Real, 0, None, closed="left"), np.ndarray],
-    #"ridge_params_dict": ["dict"], # TODO: add validation for Ridge parameters dict argument?
+        "rank": [Interval(Integral, 1, None, closed="left")],
+        "alpha": [Interval(Real, 0, None, closed="left"), np.ndarray],
+        # "ridge_params_dict": ["dict"], # TODO: add validation for Ridge parameters dict argument?
     }
 
-    def __init__(self, rank=2, alpha=0, ridge_params_dict=None): # default full rank
+    def __init__(self, rank=2, alpha=0, ridge_params_dict=None):  # default full rank
         if ridge_params_dict is None:
-            ridge_params_dict = dict(alpha=alpha) # default no regularization - equivlent to OLS
+            ridge_params_dict = dict(
+                alpha=alpha
+            )  # default no regularization - equivlent to OLS
         super().__init__(**ridge_params_dict)
         self.ridge_params_dict = ridge_params_dict
         self.rank = rank
 
     def fit(self, X, y, sample_weight=None):
-        assert y.ndim > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
-        assert y.shape[1] > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
+        assert y.ndim > 1, (
+            "There must be more than one target variable to use ReducedRankRegression."
+            " If only one target variable is required, use LinearRegression, Ridge, or"
+            " Lasso instead."
+        )
+        assert y.shape[1] > 1, (
+            "There must be more than one target variable to use ReducedRankRegression."
+            " If only one target variable is required, use LinearRegression, Ridge, or"
+            " Lasso instead."
+        )
         beta_ridge = super().fit(X, y, sample_weight=sample_weight).coef_.T
-        y_hat_ridge = super().predict(X) 
+        y_hat_ridge = super().predict(X)
         pca = PCA(n_components=self.rank)
         pca.fit(y_hat_ridge)
-        beta_proj = beta_ridge @ pca.components_.T # the encoding matrix (projects predictors from full space to space spanned by first n ranks)
-        beta_rrr = beta_proj @ pca.components_ # the reconstituted reduced rank regression matrix (same size as b_ridge)
+        beta_proj = (
+            beta_ridge @ pca.components_.T
+        )  # the encoding matrix (projects predictors from full space to space spanned by first n ranks)
+        beta_rrr = (
+            beta_proj @ pca.components_
+        )  # the reconstituted reduced rank regression matrix (same size as b_ridge)
 
         self.coef_ = beta_rrr.T
         self.encoder_ = beta_proj
         self.decoder_ = pca.components_
         return self
+
 
 class _PLS(
     ClassNamePrefixFeaturesOutMixin,
     TransformerMixin,
     RegressorMixin,
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/cross_decomposition/_pls.py

Oh no! 💥 💔 💥
1 file would be reformatted, 915 files would be left unchanged.

ruff

ruff detected issues. Please run ruff --fix --show-source . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.3.5.


warning: The `--show-source` argument is deprecated. Use `--output-format=full` instead.
sklearn/cross_decomposition/_pls.py:190:89: E501 Line too long (93 > 88)
    |
188 |     """Reduced rank regression.
189 | 
190 |     ReducedRankRegression enforces a low-rank constraint on the beta coefficients. If beta is
    |                                                                                         ^^^^^ E501
191 |     a [p x q] matrix in the case of normal regression, reduced rank regression enforces rank(beta) <= rank
192 |     where rank < min(p,q). This constraint is based on the assumption that X and Y are related by a smaller
    |

sklearn/cross_decomposition/_pls.py:191:89: E501 Line too long (106 > 88)
    |
190 |     ReducedRankRegression enforces a low-rank constraint on the beta coefficients. If beta is
191 |     a [p x q] matrix in the case of normal regression, reduced rank regression enforces rank(beta) <= rank
    |                                                                                         ^^^^^^^^^^^^^^^^^^ E501
192 |     where rank < min(p,q). This constraint is based on the assumption that X and Y are related by a smaller
193 |     number of latent factors, instead of the full space spanned by the coefficients of normal linear regression.
    |

sklearn/cross_decomposition/_pls.py:192:89: E501 Line too long (107 > 88)
    |
190 |     ReducedRankRegression enforces a low-rank constraint on the beta coefficients. If beta is
191 |     a [p x q] matrix in the case of normal regression, reduced rank regression enforces rank(beta) <= rank
192 |     where rank < min(p,q). This constraint is based on the assumption that X and Y are related by a smaller
    |                                                                                         ^^^^^^^^^^^^^^^^^^^ E501
193 |     number of latent factors, instead of the full space spanned by the coefficients of normal linear regression.
194 |     Reduced rank regression can also act as a form of regularization.
    |

sklearn/cross_decomposition/_pls.py:193:89: E501 Line too long (112 > 88)
    |
191 |     a [p x q] matrix in the case of normal regression, reduced rank regression enforces rank(beta) <= rank
192 |     where rank < min(p,q). This constraint is based on the assumption that X and Y are related by a smaller
193 |     number of latent factors, instead of the full space spanned by the coefficients of normal linear regression.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^ E501
194 |     Reduced rank regression can also act as a form of regularization.
    |

sklearn/cross_decomposition/_pls.py:196:89: E501 Line too long (114 > 88)
    |
194 |     Reduced rank regression can also act as a form of regularization.
195 | 
196 |     This implementation is built on top of sklearn.linear_model.Ridge. Thus, a ridge penalty can also be specified
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
197 |     if additional regularization is desired.
    |

sklearn/cross_decomposition/_pls.py:205:89: E501 Line too long (104 > 88)
    |
204 |     alpha : float, default=0
205 |         Regularization strength if an additional ridge penalty is desired. Must be a non-negative float.
    |                                                                                         ^^^^^^^^^^^^^^^^ E501
206 | 
207 |     ridge_params_dict : dict, default=None
    |

sklearn/cross_decomposition/_pls.py:208:89: E501 Line too long (117 > 88)
    |
207 |     ridge_params_dict : dict, default=None
208 |         A dictionary of parameters to pass to the Ridge constructor. See sklearn.linear_model.Ridge for more details.
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
209 | 
210 |     Attributes
    |

sklearn/cross_decomposition/_pls.py:251:89: E501 Line too long (94 > 88)
    |
249 |     "rank": [Interval(Integral, 1, None, closed="left")],
250 |     "alpha": [Interval(Real, 0, None, closed="left"), np.ndarray],
251 |     #"ridge_params_dict": ["dict"], # TODO: add validation for Ridge parameters dict argument?
    |                                                                                         ^^^^^^ E501
252 |     }
    |

sklearn/cross_decomposition/_pls.py:256:89: E501 Line too long (96 > 88)
    |
254 |     def __init__(self, rank=2, alpha=0, ridge_params_dict=None): # default full rank
255 |         if ridge_params_dict is None:
256 |             ridge_params_dict = dict(alpha=alpha) # default no regularization - equivlent to OLS
    |                                                                                         ^^^^^^^^ E501
257 |         super().__init__(**ridge_params_dict)
258 |         self.ridge_params_dict = ridge_params_dict
    |

sklearn/cross_decomposition/_pls.py:262:89: E501 Line too long (190 > 88)
    |
261 |     def fit(self, X, y, sample_weight=None):
262 |         assert y.ndim > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
263 |         assert y.shape[1] > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
264 |         beta_ridge = super().fit(X, y, sample_weight=sample_weight).coef_.T
    |

sklearn/cross_decomposition/_pls.py:263:89: E501 Line too long (194 > 88)
    |
261 |     def fit(self, X, y, sample_weight=None):
262 |         assert y.ndim > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
263 |         assert y.shape[1] > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
264 |         beta_ridge = super().fit(X, y, sample_weight=sample_weight).coef_.T
265 |         y_hat_ridge = super().predict(X) 
    |

sklearn/cross_decomposition/_pls.py:265:41: W291 [*] Trailing whitespace
    |
263 |         assert y.shape[1] > 1, "There must be more than one target variable to use ReducedRankRegression. If only one target variable is required, use LinearRegression, Ridge, or Lasso instead."
264 |         beta_ridge = super().fit(X, y, sample_weight=sample_weight).coef_.T
265 |         y_hat_ridge = super().predict(X) 
    |                                         ^ W291
266 |         pca = PCA(n_components=self.rank)
267 |         pca.fit(y_hat_ridge)
    |
    = help: Remove trailing whitespace

sklearn/cross_decomposition/_pls.py:268:89: E501 Line too long (144 > 88)
    |
266 |         pca = PCA(n_components=self.rank)
267 |         pca.fit(y_hat_ridge)
268 |         beta_proj = beta_ridge @ pca.components_.T # the encoding matrix (projects predictors from full space to space spanned by first n ranks)
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
269 |         beta_rrr = beta_proj @ pca.components_ # the reconstituted reduced rank regression matrix (same size as b_ridge)
    |

sklearn/cross_decomposition/_pls.py:269:89: E501 Line too long (120 > 88)
    |
267 |         pca.fit(y_hat_ridge)
268 |         beta_proj = beta_ridge @ pca.components_.T # the encoding matrix (projects predictors from full space to space spanned by first n ranks)
269 |         beta_rrr = beta_proj @ pca.components_ # the reconstituted reduced rank regression matrix (same size as b_ridge)
    |                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E501
270 | 
271 |         self.coef_ = beta_rrr.T
    |

Found 14 errors.
[*] 1 fixable with the `--fix` option.

Generated for commit: 5c95912. Link to the linter CI: here

@mdmelin mdmelin changed the title [WIP] Add ReducedRankRegression estimator (Fixes #10796) [WIP] Add ReducedRankRegression estimator (Resolves#10796) Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduced-rank regression
1 participant