Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Standardize X as inverse_transform input parameter #28756

Merged
merged 21 commits into from Apr 29, 2024

Conversation

wd60622
Copy link
Contributor

@wd60622 wd60622 commented Apr 2, 2024

Reference Issues/PRs

closes #27654
related to #27666

What does this implement/fix? Explain your changes.

This changes all X to Xt in inverse_transform methods. This then makes the API:

  • Transform: Xt = transformer.transform(X)
  • Inverse Transform: X = transformer.inverse_transform(Xt)

Any other comments?

Hi @glemaitre . I am finally getting back to this!

It turns out that X is much more common as a parameter than Xt. I've started with a few in order to get some feedback before rinsing and repeating the changes. Please let me know if you have any feedback on the:

  1. deprecation helper function
  2. Docstring changes
  3. Changes to variable names

One thing I suspect is that tests might have used keyword arguments as well and thus cause warnings. I will switch if that is the case

Copy link

github-actions bot commented Apr 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a4e3fe4. Link to the linter CI: here

@betatim
Copy link
Member

betatim commented Apr 3, 2024

The approach looks reasonable to me. It is based on how the deprecation of Y for y is done.

For me the big question is if it is worth the effort to do this. What is the problem we are fixing by doing this (except for solving a naming inconsistency)?

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 3, 2024

The approach looks reasonable to me. It is based on how the deprecation of Y for y is done.

For me the big question is if it is worth the effort to do this. What is the problem we are fixing by doing this (except for solving a naming inconsistency)?

I encountered it while trying to switch a Pipeline out with a StandardScaler and using kwargs.
This was my reproducible example from the issue:

# Has Xt arg
pipeline = make_pipeline(StandardScaler())
Xt = pipeline.fit_transform(X)
X_again = pipeline.inverse_transform(Xt=Xt)

# inverse_transform takes X instead of Xt
transformer = StandardScaler()
Xt = transformer.fit_transform(X)
X_again = transformer.inverse_transform(Xt=Xt)
# TypeError: inverse_transform() got an unexpected keyword argument 'Xt'

Pipelines might be the most used object but I'd think having a uniformed API with the transformers themselves might be good.

What are your thoughts?

@betatim
Copy link
Member

betatim commented Apr 4, 2024

I agree it is a bit annoying that using/not using a Pipeline changes whether your code works or not.

The big question for me is how typical transformer.inverse_transform(Xt) vs transformer.inverse_transform(Xt=Xt) is. My guess would be that most people use the "positional argument" version. So they never run into the problem that the argument is sometimes called X and sometimes Xt. However I don't know how we could resolve this question with some data, instead of people's hunches :-/

Maybe other people/maintainers have an opinion? @scikit-learn/core-devs

@GaelVaroquaux
Copy link
Member

I'm not really in favor of such a change:

@jeremiedbb
Copy link
Member

Here's the list:

CCA                           X
DictVectorizer                X
FastICA                       X
FeatureAgglomeration          Xt
FunctionTransformer           X
GaussianRandomProjection      X
GenericUnivariateSelect       X
GridSearchCV                  Xt
IncrementalPCA                X
KBinsDiscretizer              Xt
KernelPCA                     X
LabelBinarizer                Y
LabelEncoder                  y
MaxAbsScaler                  X
MinMaxScaler                  X
MiniBatchNMF                  Xt
MiniBatchSparsePCA            X
MultiLabelBinarizer           yt
NMF                           Xt
OneHotEncoder                 X
OrdinalEncoder                X
PCA                           X
PLSCanonical                  X
PLSRegression                 X
Pipeline                      Xt
PowerTransformer              X
QuantileTransformer           X
RFE                           X
RFECV                         X
RandomizedSearchCV            Xt
RobustScaler                  X
SelectFdr                     X
SelectFpr                     X
SelectFromModel               X
SelectFwe                     X
SelectKBest                   X
SelectPercentile              X
SequentialFeatureSelector     X
SimpleImputer                 X
SparsePCA                     X
SparseRandomProjection        X
StandardScaler                X
TruncatedSVD                  X
VarianceThreshold             X

It's true that most estimators use X. However it's unfortunate that the main meta-estimators (Pipeline, Search*) use Xt, as explained in #28756 (comment).

I agree that we should not change for cosmetic reasons. But the inconsistency between most estimators and the main meta-estimators gives a bad user experience that we can improve. I don't know however how many users are concerned, i.e. using kwargs for X.

Side note: only the users who pass X as kwarg would actually see the API change and the deprecation warnings and this change aims to improve the experience of these users. It would be transparent for other users.

So overall I'm +0.5 here.

@betatim
Copy link
Member

betatim commented Apr 4, 2024

I didn't count but from scrolling through the list of estimators it seems many of them use X. Maybe, if we do anything, we should standardise on X?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 4, 2024 via email

@jeremiedbb
Copy link
Member

jeremiedbb commented Apr 4, 2024

mine too at first, but notice that the Xts are in Pipeline, GridSearchCV, RandomizedSearchCV, ..., so changing them might impact as many users as changing the others :/

@thomasjpfan
Copy link
Member

I feel like the impact is minimum because it only impacts users that write inverse_transform(Xt=...). In all our documentation, we always set the input of inverse_transform as a positional argument.

We'll still go through a deprecation cycle and the change is fairly simple for users to make. (Make the input positional or use X=...).

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 4, 2024

Thanks everyone for the feedback!

Thanks for making the list @jeremiedbb. Very helpful!
My sense is that Pipeline, GridSearchCV, RandomizedSearchCV would be the most used and since they use Xt then that should be the way to go.

Personally, X doesn't make much sense to me if
Xt = transform(X)
since then
X = inverse_transform(Xt)

I want to note that the goal is to be backwards compatible. That would mean:

  • No warning with positional arguments. instance.inverse_transform(Xt)
  • Standard deprecation warning if keyword X. instance.inverse_transform(X=...)
  • ValueError if both keywords are used

I will fix all of the examples in docstrings if they use a keyword example. However, I don't think any do at first glance. https://github.com/search?q=repo%3Ascikit-learn%2Fscikit-learn%20%22inverse_transform(X%3D%22&type=code

I didn't have None for Xt which might have caused some confusion. I've added that with the latest commit!

@lorentzenchr
Copy link
Member

+1 for making it consistent. For me, it‘s not a cosmetic change, but a real flaw. The sooner the fix the better.

@wd60622 wd60622 changed the title Standardize Xt as inverse_transform argument Standardize Xt as inverse_transform parameter Apr 4, 2024
@wd60622
Copy link
Contributor Author

wd60622 commented Apr 4, 2024

+1 for making it consistent. For me, it‘s not a cosmetic change, but a real flaw. The sooner the fix the better.

🎉

If others have the same idea, then I will go through with the rest of the implementation!

@lorentzenchr
Copy link
Member

@wd60622 I‘m just one voice and @GaelVaroquaux has a different view than me. So it’s not decided yet.

@lorentzenchr lorentzenchr added the Needs Decision Requires decision label Apr 4, 2024
@chkoar
Copy link
Contributor

chkoar commented Apr 5, 2024

I am not a core dev but regarding the change I am aligned with @GaelVaroquaux.
Also, for me the context has already be provided by the method itself inverse_transform so there is not need for Xt.
For consistency I would prefer everywhere X.
However, I expect that most of us use inverse_transform with positional argument.

Regarding the impact I agree with @thomasjpfan

I feel like the impact is minimum because it only impacts users that write inverse_transform(Xt=...). In all our documentation, we always set the input of inverse_transform as a positional argument.

We'll still go through a deprecation cycle and the change is fairly simple for users to make. (Make the input positional or use X=...).

A public poll regarding the use of positional or keyword arguments usage in inverse_transform would provide a sense of the impact of the change.

@lorentzenchr
Copy link
Member

A public poll

That's expensive and I would dedicate such a poll to much more impactful questions.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 5, 2024 via email

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 6, 2024

Consistency is a good thing.

Totally, agree. Hence the PR 😄

Hence given the summary above showing that X and Xt are used, I can be convinced :) I'm slightly more in favor of converging to X than to Xt.

Is there an issue with the name Xt? Why is X better than Xt given my argument that that name doesn't make much sense?
Even though more transformers use X, I think that should be weighted by the frequency of use. To me, the Pipeline, GridSearchCV, RandomizedSearchCV are the most used. Since they use Xt, the scale leans toward Xt if it is a weighted average.
However, my usage weights might be a wrong assumption. Just my gut off using the package the last few years

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 6, 2024

I'm slightly more in favor of converging to X than to Xt.

+1. Makes the API simpler.

I guess we have reached a decision.

@lorentzenchr lorentzenchr removed the Needs Decision Requires decision label Apr 6, 2024
@wd60622
Copy link
Contributor Author

wd60622 commented Apr 6, 2024

I'm slightly more in favor of converging to X than to Xt.

+1. Makes the API simpler.

I guess we have reached a decision.

And that is X?

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 6, 2024

X it is!

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @wd60622

@betatim this is ready for a second review :)

def _deprecate_Xt_in_inverse_transform(X, Xt):
"""Helper to deprecate the `Xt` argument in favor of `X` in inverse_transform."""
if X is not None and Xt is not None:
raise ValueError("Cannot use both X and Xt. Use X only.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to the party with my comment. I'd use a TypeError here as well. It seems more consistent with the case below where the user forgot to pass any argument. ValueError makes me think that I passed the wrong value, not that I tried to use an argument I shouldn't.

Comment on lines 123 to 130
raise ValueError("Cannot use both X and Xt. Use X only.")

if X is None and Xt is None:
raise TypeError("Missing required positional argument: 'X'.")

if Xt is not None:
warnings.warn(
"Xt was renamed X in version 1.5 and will be removed in 1.7.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick: can we use 'X' and 'Xt' everywhere instead of mixing X and 'X'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not mixing. However I then prefer bare X because we usually don't put parameter names between quotes in other error/warning messages. Is it ok for you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in strings, variables should have single quote surrounding? That is, X -> 'X'
Does this go for the docstring changes as well (if outside of the normal parameter definitions)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defer to Jeremie for what the correct formatting is. It sounds like it should be "Do not pass X and Xt at the same time". Not "Do not pass 'X' and 'Xt' at the same time".

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just two small cosmetic comments. Good to merge after we resolve them.

Comment on lines 125 to 126
if X is None and Xt is None:
raise TypeError("Missing required positional argument: 'X'.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TypeError makes sense here @betatim ?

Copy link
Member

@betatim betatim Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say so. For me passing an argument or missing a required argument is a TypeError.

@jeremiedbb jeremiedbb enabled auto-merge (squash) April 29, 2024 13:13
auto-merge was automatically disabled April 29, 2024 13:21

Head branch was pushed to by a user without write access

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 29, 2024

I have made the adjustments based on the latest feedback @betatim @jeremiedbb. Hope they're squashed 😆

@jeremiedbb
Copy link
Member

@wd60622 I think that you misunderstood, the decision was not to put them between quotes. Can you please revert your latest changes, or I can do it if you prefer ?

@wd60622
Copy link
Contributor Author

wd60622 commented Apr 29, 2024

@wd60622 I think that you misunderstood, the decision was not to put them between quotes. Can you please revert your latest changes, or I can do it if you prefer ?

Sorry. I did but them between quotes though! Not sure what you mean then

Yeah, you can revert!

@jeremiedbb
Copy link
Member

the decision was not to put them between quotes.

"not to" 😄 . Don't worry, I'll revert.

@jeremiedbb jeremiedbb force-pushed the standardize-inverse-transform branch from c2118a6 to 29d37a0 Compare April 29, 2024 14:08
@jeremiedbb jeremiedbb enabled auto-merge (squash) April 29, 2024 14:12
@jeremiedbb jeremiedbb merged commit 0bdc754 into scikit-learn:main Apr 29, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inverse_transform Xt argument consistency
8 participants