Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Extend PDP for nominal categorical features #18298

Merged
merged 132 commits into from Nov 25, 2022

Conversation

madhuracj
Copy link
Contributor

@madhuracj madhuracj commented Aug 30, 2020

Reference Issues/PRs

closes #14969

What does this implement/fix? Explain your changes.

This PR extends PDP for categorical features. Still WIP

Any other comments?

@@ -644,12 +648,18 @@ def test_partial_dependence_dataframe(estimator, preprocessor, features):

@pytest.mark.parametrize(
"features, expected_pd_shape",
[(0, (3, 10)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the API spec of partial_dependence() only allows for array-like of {int, str} for feature parameter (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_partial_dependence.py#L248), we seem to be allowing int, str, list of str, list of int, and boolean mask.
This unnecessarily complicated initialization of is_categorical parameter when it's None at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-0c8623ee957e0256d29ac53830094281R488-R489.
Do we need to keep supporting this undocumented behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case was supported before but the documentation is misleading. Basically we should support a sequence of {int, str} or tuple of {int, str}:

https://github.com/scikit-learn/scikit-learn/pull/12599/files#diff-f6ab89354a7e7254ec59bb2e192b0ec2R211-R218

So the test here create a scalar or a tuple (actually this is a list) which will be added in another list within the test.

Maybe changing the documentation with the following would be less ambiguous: features : array-like of {int, str} or tuple of 2 {int, str}

@glemaitre glemaitre added this to TO REVIEW in Guillaume's pet Aug 31, 2020
@@ -453,6 +501,13 @@ class PartialDependenceDisplay:

.. versionadded:: 0.24

is_categorical : list of (bool,) or list of (bool, bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is possible to make a 2-way partial dependence plot where 1 of the features only is categorical, meaning something like: is_categorical=[(True, False)]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe not. That's what's being checked at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-51e11eb022da44742c62c8a5d33c00d4R357-R359. Probably best to specify that in docs itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could only accept list of bool and raise an error if True is given for a tuple.

Copy link
Contributor Author

@madhuracj madhuracj Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support is_categorical=[(True, True)] with a grid and colour of the cell denoting the partial dependency for the category combination. But, if this is not required, I am fine to go with a list of bool and raise an error if True is given for a tuple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should replace it with categorical_features as well.
It might mean that we have to validate the parameter in the Display instead of the do it in the plot function.

@madhuracj
Copy link
Contributor Author

@glemaitre I wonder who I should go about updating the PDP/ICE example (https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html). The current dataset used in there does not have categorical columns and I don't think it's worth introducing a new dataset.

@glemaitre glemaitre added this to the 1.0 milestone Jan 18, 2021
@glemaitre
Copy link
Member

@madhuracj I added this to the 1.0 milestone. Let's target it.

@glemaitre I wonder who I should go about updating the PDP/ICE example (https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html). The current dataset used in there does not have categorical columns and I don't think it's worth introducing a new dataset.

Yes, you are right. We might want to add an example specifically for categorical columns. The current example is already quite rich.

Base automatically changed from master to main January 22, 2021 10:53
@madhuracj
Copy link
Contributor Author

@madhuracj I added this to the 1.0 milestone. Let's target it.

I have merged the upstream main branch which contained some major refactoring introduced since. Hopefully, I can find some time in the next couple of days to work on this.

@madhuracj
Copy link
Contributor Author

@glemaitre With 7c1fe2e I have implemented two-way PDP for categorical features. Please have a look.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @ogrisel do you have more comments ?

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2022

I will give it another quick look.

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2022

I get the following warning raised by pandas 1.5.1 in the example:

/Users/ogrisel/code/scikit-learn/sklearn/utils/__init__.py:386: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  X.iloc[row_indexer, column_indexer] = values

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2022

I also get it with pandas 1.5.2.

@glemaitre
Copy link
Member

Uhm. And the warning is something that we want to make the change in place. Not sure how to silent it without catching it.

@glemaitre
Copy link
Member

We already step in those: pandas-dev/pandas#47381

ax=ax,
**common_params,
)
Copy link
Member

@ogrisel ogrisel Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, this figure has a too narrow ylim but @jeremiedbb and I do not understand why. This problem does not happen to the other plots above. Changing constrained_layout does not seem to impact this.

pdp_lim is left to its default (None) value.

@jeremiedbb
Copy link
Member

The warning comes from trying to do df.iloc[:, i] = value where the ith column has the "category" dtype and value is a string, because pandas is expecting value to be of object dtype, but since it's just a scalar it's a string.

The behavior won't change in the future in this situation though so we thought that can safely filter it. I added a comment and a todo to remind us to check when it's possible remove the filter.

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2022

I pushed a cosmetic commit to:

  • pass categorical_features to the HGBRT models (since we already extract the list of feature names for the PDP plot);
  • use constrained layout consistently to avoid manual adjustments of whitespaces.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (assuming green CI)

@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2022

I forgot to update the tests. I am on it.

@jjerphan jjerphan merged commit c1cfc4d into scikit-learn:main Nov 25, 2022
Interpretability / Plotting / Interactive dev automation moved this from Reviewer approved to Done Nov 25, 2022
@jjerphan
Copy link
Member

Thank you, @madhuracj and @glemaitre! 🎉

@lorentzenchr
Copy link
Member

It is a very good improvement to have this PR shipped with the next release. Congrats!

On the other hand, I wonder what happened with @NicolasHug comments in #18298 (comment). I was not able to follow the discussion and now I fear that we have implemented what @NicolasHug wanted to avoid.

@ogrisel
Copy link
Member

ogrisel commented Nov 28, 2022

Indeed, I also missed @NicolasHug's comment. I also think it was a good idea... But we can always implement it on top of the existing categorical_features parameter, that is, when the user does not explicitly pass the categorical features.

@glemaitre
Copy link
Member

This could be implemented with categorical_features="infer" ("auto" could use the "category" dtype).

In the inferring setup, I am not entirely sure how to deal with the case we detect a continuous and discretised feature. We will error but it could be a case where it should be 2 continuous features with one feature with few values. In this setting, without any parameter, a user will never be able to plot anything. Then, one would need to expose the n_unique to get away with this issue.

I am also not sure about computing unique values on continuous features.

@NicolasHug
Copy link
Member

NicolasHug commented Nov 28, 2022

we can always implement it on top of the existing categorical_features parameter, that is, when the user does not explicitly pass the categorical features

This should be possible, although ideally we would only need one parameter instead of 2. If there are plans to support something like #18298 (comment), it might be worth changing the current parameter name from categorical_features to something more generic that can also account for the other parametrization.

Regarding computing unique values on continuous features, this is a fairly common practice I think and shouldn't be a deal-breaker for plotting utilities where perf isn't critical. We already do that in PDPs, or in the binner of the hist-GBDT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement High Priority High priority issues and pull requests module:inspection Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
Status: Done
Guillaume's pet
REVIEWED AND WAITING FOR CHANGES
Development

Successfully merging this pull request may close these issues.

Enhancement for partial dependence plot