New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Extend PDP for nominal categorical features #18298
Changes from 5 commits
2f3b52d
9f066db
0eb3053
fafa9d2
6a7e87d
22d66a1
eca5ac7
6b0d7c2
284c901
ed254ca
7bca03f
ecf2bf8
b16aa23
fe6580b
cebadbf
708e40c
7c1fe2e
dc45e16
819f221
399ea0c
88a12bf
67b5fbf
78ef2cf
d2d5f74
c2f0649
9f61ff4
e431998
3acccb4
d96640e
10b1ef4
058e429
0e8a535
6b4f51c
fbec9a3
fedf2b2
6e79bb7
6b5d5ba
dd3561e
f79528e
b8f0082
4c5d2dc
6170de9
6ab4e2d
49dbbf7
92653c0
c9ce64d
934d876
d6e3fb2
4009d5e
7996b6c
8a940cb
da270a3
8adeda3
c7e63a8
4d2ae7e
68a062b
be810fb
dd8cf35
94fa738
a4b4bdb
6aaa3ed
65fd37d
a5f5319
c15d378
1512726
a65c4d3
b3a2843
d120f15
add4890
fbf82b2
cf27475
3850da0
cc268e1
178b377
58fd04d
db7bb49
0db445f
2f62724
5748cf8
2c420d2
80b42cb
78ab357
68f5f0e
25d92f7
be3be50
1c7f563
2bf4846
5383c96
f59e282
bfe6179
1ef227f
170d642
d68bd2d
11653c9
90b6055
b3d4fab
5f8a76b
85b2105
7d95968
bf208bd
13d4991
245f561
536adf7
00188e1
108e50e
99e139a
5fc53de
7c952ea
b269141
292beec
c44fea1
5cde742
0237316
fba1571
c7fc26d
c439325
30c13ba
346c432
cf6ec31
cd354a1
a865e20
5b273de
bf3d741
92a6f90
8b42d1f
0c96552
0bc73ed
d5f4c48
1ee652f
a530c86
5606e52
bb608d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,8 @@ def plot_partial_dependence(estimator, X, features, *, feature_names=None, | |
grid_resolution=100, percentiles=(0.05, 0.95), | ||
method='auto', n_jobs=None, verbose=0, | ||
line_kw=None, contour_kw=None, ax=None, | ||
kind='average', subsample=1000): | ||
kind='average', subsample=1000, | ||
is_categorical=None): | ||
"""Partial dependence (PD) and individual conditional expectation (ICE) | ||
plots. | ||
|
||
|
@@ -209,6 +210,13 @@ def plot_partial_dependence(estimator, X, features, *, feature_names=None, | |
|
||
.. versionadded:: 0.24 | ||
|
||
is_categorical : list of {boolean, pair of boolean}, default=None | ||
Whether each target feature in `features` is categorical or not. | ||
The list should be same size as `features`. If `None`, all features | ||
are assumed to be continuous. | ||
|
||
.. versionadded:: 0.24 | ||
|
||
Returns | ||
------- | ||
display: :class:`~sklearn.inspection.PartialDependenceDisplay` | ||
|
@@ -321,15 +329,53 @@ def convert_feature(fx): | |
f"the (0, 1) range." | ||
) | ||
|
||
if is_categorical is None: | ||
is_categorical = [] | ||
for fxs in features: | ||
cats = (False,) if len(fxs) == 1 else (False, False) | ||
is_categorical.append(cats) | ||
else: | ||
if len(features) != len(is_categorical): | ||
raise ValueError('Parameter is_categorical should be the same ' | ||
'size as features.') | ||
|
||
tmp_categorical = [] | ||
has_categorical = False | ||
for cats in is_categorical: | ||
if isinstance(cats, bool): | ||
cats = (cats,) | ||
try: | ||
cats = tuple(c for c in cats) | ||
except TypeError as e: | ||
raise ValueError( | ||
'Each entry in is_categorical must be either a boolean ' | ||
'or an iterable of size at most 2.' | ||
) from e | ||
if not 1 <= np.size(cats) <= 2: | ||
raise ValueError('Each entry in is_categorical must be either ' | ||
'a boolean or an iterable of size at most 2.') | ||
if np.size(cats) == 2 and (cats[0] is True or cats[1] is True): | ||
raise ValueError('Contour plots are not supported for ' | ||
'categorical features.') | ||
if cats[0] is True or (np.size(cats) == 2 and cats[1] is True): | ||
has_categorical = True | ||
tmp_categorical.append(cats) | ||
is_categorical = tmp_categorical | ||
|
||
if has_categorical and kind != 'average': | ||
raise ValueError('It is not possible to display individual effects' | ||
' for categorical features.') | ||
|
||
# compute predictions and/or averaged predictions | ||
pd_results = Parallel(n_jobs=n_jobs, verbose=verbose)( | ||
delayed(partial_dependence)(estimator, X, fxs, | ||
response_method=response_method, | ||
method=method, | ||
grid_resolution=grid_resolution, | ||
percentiles=percentiles, | ||
kind=kind) | ||
for fxs in features) | ||
kind=kind, | ||
is_categorical=cats) | ||
for fxs, cats in zip(features, is_categorical)) | ||
|
||
# For multioutput regression, we can only check the validity of target | ||
# now that we have the predictions. | ||
|
@@ -362,10 +408,11 @@ def convert_feature(fx): | |
pdp_lim[n_fx] = (min_pd, max_pd) | ||
|
||
deciles = {} | ||
for fx in chain.from_iterable(features): | ||
if fx not in deciles: | ||
X_col = _safe_indexing(X, fx, axis=1) | ||
deciles[fx] = mquantiles(X_col, prob=np.arange(0.1, 1.0, 0.1)) | ||
for fxs, cats in zip(features, is_categorical): | ||
for fx, cat in zip(fxs, cats): | ||
if not cat and fx not in deciles: | ||
X_col = _safe_indexing(X, fx, axis=1) | ||
deciles[fx] = mquantiles(X_col, prob=np.arange(0.1, 1.0, 0.1)) | ||
|
||
display = PartialDependenceDisplay(pd_results=pd_results, | ||
features=features, | ||
|
@@ -374,7 +421,8 @@ def convert_feature(fx): | |
pdp_lim=pdp_lim, | ||
deciles=deciles, | ||
kind=kind, | ||
subsample=subsample) | ||
subsample=subsample, | ||
is_categorical=is_categorical) | ||
return display.plot(ax=ax, n_cols=n_cols, line_kw=line_kw, | ||
contour_kw=contour_kw) | ||
|
||
|
@@ -453,6 +501,13 @@ class PartialDependenceDisplay: | |
|
||
.. versionadded:: 0.24 | ||
|
||
is_categorical : list of (bool,) or list of (bool, bool) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is possible to make a 2-way partial dependence plot where 1 of the features only is categorical, meaning something like: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe not. That's what's being checked at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-51e11eb022da44742c62c8a5d33c00d4R357-R359. Probably best to specify that in docs itself? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we could only accept There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My idea was to follow the same shape of features (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_plot/partial_dependence.py#L405-L408) as discussed in #14969 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, we should replace it with
glemaitre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Whether each target feature in `features` is categorical or not. | ||
The list should be same size as `features`. If `None`, all features | ||
are assumed to be continuous. | ||
|
||
.. versionadded:: 0.24 | ||
|
||
Attributes | ||
---------- | ||
bounding_ax_ : matplotlib Axes or None | ||
|
@@ -501,7 +556,8 @@ class PartialDependenceDisplay: | |
""" | ||
@_deprecate_positional_args | ||
def __init__(self, pd_results, *, features, feature_names, target_idx, | ||
pdp_lim, deciles, kind='average', subsample=1000): | ||
pdp_lim, deciles, kind='average', subsample=1000, | ||
is_categorical=None): | ||
self.pd_results = pd_results | ||
self.features = features | ||
self.feature_names = feature_names | ||
|
@@ -510,6 +566,7 @@ def __init__(self, pd_results, *, features, feature_names, target_idx, | |
self.deciles = deciles | ||
self.kind = kind | ||
self.subsample = subsample | ||
self.is_categorical = is_categorical | ||
|
||
def _get_sample_count(self, n_samples): | ||
if isinstance(self.subsample, numbers.Integral): | ||
|
@@ -520,7 +577,8 @@ def _get_sample_count(self, n_samples): | |
return ceil(n_samples * self.subsample) | ||
return n_samples | ||
|
||
def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | ||
def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None, | ||
bar_kw=None): | ||
"""Plot partial dependence plots. | ||
|
||
Parameters | ||
|
@@ -547,6 +605,10 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
Dict with keywords passed to the `matplotlib.pyplot.contourf` | ||
call for two-way partial dependence plots. | ||
|
||
bar_kw : dict, default=None | ||
Dict with keywords passed to the `matplotlib.pyplot.bar` call. | ||
For categorical partial dependence plots. | ||
|
||
Returns | ||
------- | ||
display: :class:`~sklearn.inspection.PartialDependenceDisplay` | ||
|
@@ -563,6 +625,8 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
line_kw = {} | ||
if contour_kw is None: | ||
contour_kw = {} | ||
if bar_kw is None: | ||
bar_kw = {} | ||
|
||
if ax is None: | ||
_, ax = plt.subplots() | ||
|
@@ -574,6 +638,9 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
line_kw = {**default_line_kws, **line_kw} | ||
individual_line_kw = line_kw.copy() | ||
|
||
default_bar_kws = {'color': 'C0'} | ||
bar_kw = {**default_bar_kws, **bar_kw} | ||
|
||
if self.kind == 'individual' or self.kind == 'both': | ||
individual_line_kw['alpha'] = 0.3 | ||
individual_line_kw['linewidth'] = 0.5 | ||
|
@@ -609,6 +676,7 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
self.lines_ = np.empty((n_rows, n_cols, n_sampled), | ||
dtype=object) | ||
self.contours_ = np.empty((n_rows, n_cols), dtype=object) | ||
self.bars_ = np.empty((n_rows, n_cols), dtype=object) | ||
|
||
axes_ravel = self.axes_.ravel() | ||
|
||
|
@@ -637,6 +705,7 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
self.lines_ = np.empty(ax.shape + (n_sampled,), | ||
dtype=object) | ||
self.contours_ = np.empty_like(ax, dtype=object) | ||
self.bars_ = np.empty_like(ax, dtype=object) | ||
|
||
# create contour levels for two-way plots | ||
if 2 in self.pdp_lim: | ||
|
@@ -648,11 +717,15 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
# Create 1d views of these 2d arrays for easy indexing | ||
lines_ravel = self.lines_.ravel(order='C') | ||
contours_ravel = self.contours_.ravel(order='C') | ||
bars_ravel = self.bars_.ravel(order='C') | ||
vlines_ravel = self.deciles_vlines_.ravel(order='C') | ||
hlines_ravel = self.deciles_hlines_.ravel(order='C') | ||
|
||
for i, axi, fx, pd_result in zip(count(), self.axes_.ravel(), | ||
self.features, self.pd_results): | ||
for i, axi, fx, cat, pd_result in zip(count(), | ||
self.axes_.ravel(), | ||
self.features, | ||
self.is_categorical, | ||
self.pd_results): | ||
|
||
avg_preds = None | ||
preds = None | ||
|
@@ -679,10 +752,17 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
values[0], ins.ravel(), **individual_line_kw | ||
)[0] | ||
if self.kind == 'average': | ||
lines_ravel[i] = axi.plot( | ||
values[0], avg_preds[self.target_idx].ravel(), | ||
**line_kw | ||
)[0] | ||
if cat[0] is True: | ||
bars_ravel[1] = axi.bar( | ||
values[0], avg_preds[self.target_idx].ravel(), | ||
**bar_kw | ||
)[0] | ||
axi.tick_params(axis='x', rotation=90) | ||
else: | ||
lines_ravel[i] = axi.plot( | ||
values[0], avg_preds[self.target_idx].ravel(), | ||
**line_kw | ||
)[0] | ||
elif self.kind == 'both': | ||
lines_ravel[i] = axi.plot( | ||
values[0], avg_preds[self.target_idx].ravel(), | ||
|
@@ -705,8 +785,9 @@ def plot(self, ax=None, n_cols=3, line_kw=None, contour_kw=None): | |
trans = transforms.blended_transform_factory(axi.transData, | ||
axi.transAxes) | ||
ylim = axi.get_ylim() | ||
vlines_ravel[i] = axi.vlines(self.deciles[fx[0]], 0, 0.05, | ||
transform=trans, color='k') | ||
if self.deciles.get(fx[0], None) is not None: | ||
vlines_ravel[i] = axi.vlines(self.deciles[fx[0]], 0, 0.05, | ||
transform=trans, color='k') | ||
axi.set_ylim(ylim) | ||
|
||
# Set xlabel if it is not already set | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,7 +131,8 @@ def test_grid_from_X(): | |
grid_resolution = 100 | ||
X = np.asarray([[1, 2], | ||
[3, 4]]) | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution) | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution, | ||
is_categorical=[False, False]) | ||
assert_array_equal(grid, [[1, 2], | ||
[1, 4], | ||
[3, 2], | ||
|
@@ -145,15 +146,17 @@ def test_grid_from_X(): | |
|
||
# n_unique_values > grid_resolution | ||
X = rng.normal(size=(20, 2)) | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution=grid_resolution) | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution=grid_resolution, | ||
is_categorical=[False, False]) | ||
assert grid.shape == (grid_resolution * grid_resolution, X.shape[1]) | ||
assert np.asarray(axes).shape == (2, grid_resolution) | ||
|
||
# n_unique_values < grid_resolution, will use actual values | ||
n_unique_values = 12 | ||
X[n_unique_values - 1:, 0] = 12345 | ||
rng.shuffle(X) # just to make sure the order is irrelevant | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution=grid_resolution) | ||
grid, axes = _grid_from_X(X, percentiles, grid_resolution=grid_resolution, | ||
is_categorical=[False, False]) | ||
assert grid.shape == (n_unique_values * grid_resolution, X.shape[1]) | ||
# axes is a list of arrays of different shapes | ||
assert axes[0].shape == (n_unique_values,) | ||
|
@@ -174,7 +177,8 @@ def test_grid_from_X_error(grid_resolution, percentiles, err_msg): | |
X = np.asarray([[1, 2], [3, 4]]) | ||
with pytest.raises(ValueError, match=err_msg): | ||
_grid_from_X( | ||
X, grid_resolution=grid_resolution, percentiles=percentiles | ||
X, grid_resolution=grid_resolution, percentiles=percentiles, | ||
is_categorical=[False, False] | ||
) | ||
|
||
|
||
|
@@ -644,12 +648,18 @@ def test_partial_dependence_dataframe(estimator, preprocessor, features): | |
|
||
@pytest.mark.parametrize( | ||
"features, expected_pd_shape", | ||
[(0, (3, 10)), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even though the API spec of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The use case was supported before but the documentation is misleading. Basically we should support a sequence of {int, str} or tuple of {int, str}: So the test here create a scalar or a tuple (actually this is a list) which will be added in another list within the test. Maybe changing the documentation with the following would be less ambiguous: |
||
(iris.feature_names[0], (3, 10)), | ||
[ | ||
# (0, (3, 10)), | ||
# (iris.feature_names[0], (3, 10)), | ||
([0, 2], (3, 10, 10)), | ||
([iris.feature_names[i] for i in (0, 2)], (3, 10, 10)), | ||
([True, False, True, False], (3, 10, 10))], | ||
ids=['scalar-int', 'scalar-str', 'list-int', 'list-str', 'mask'] | ||
# ([True, False, True, False], (3, 10, 10)) | ||
], | ||
ids=[ | ||
# 'scalar-int', 'scalar-str', | ||
'list-int', 'list-str', | ||
# 'mask' | ||
] | ||
) | ||
def test_partial_dependence_feature_type(features, expected_pd_shape): | ||
# check all possible features type supported in PDP | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only place were we make use of the
is_categorical
information inpartial_dependence()
right?If yes, let's just remove this parameter altogether from
partial_dependence()
. We can just tell users that the grid resolution should always be higher than the max number of categories, which in most cases should be verified. And this is something we can enforce and check inplot_partial_dependence()
, beforepartial_dependence()
is called. This shouldn't lead to any problem in 99.999% of the use-cases of that function I think.Once we remove the parameter from
partial_depdence()
, we don't have to worry about API discrepancy anymore, we can just bikeshed on what the parameter should be forplot_partial_dependence()
:)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Thanks @NicolasHug