-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add Friedman's H-squared #28375
Open
mayer79
wants to merge
65
commits into
scikit-learn:main
Choose a base branch
from
mayer79:friedmans-h
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+732
−3
Open
ENH Add Friedman's H-squared #28375
Changes from all commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
2f06cbc
Add Friedman's H-squared of pairwise interaction statistics
mayer79 38f3d6c
run black
mayer79 f173c51
np.unique() does not work well for non-standard values
mayer79 2e7b731
Reorganize imports
mayer79 64581e2
run ruff
mayer79 8e50637
check weights and is fitted
efa4071
Replace compression logic by try except
f53b838
Switch to Bunch output
032e661
Clip small numerators
1335f05
Apply suggestions from code review
mayer79 a28e8e0
use sample_without_replacement, plus some docstring improvements
mayer79 66f7bdd
fix existing problems
mayer79 6db5ac7
Merge branch 'main' into friedmans-h
mayer79 5beb941
Rename things and fix imports
mayer79 9823fe5
More compact output organization, faster example
mayer79 5aebb4c
Add formula to docstring
mayer79 36ed7a8
Add preliminary unit tests
mayer79 27e3540
Compare against two R packages
mayer79 e5f6e53
Split calculate_pd_over_data into two plus some optimizations
mayer79 a2b659d
Fix typos in docstring
mayer79 69111e3
Fix example output in docstring
mayer79 0416de3
Apply suggestions from code review
mayer79 3fe5d64
More changed from review
mayer79 c583aeb
add validate_params()
mayer79 a443f02
Add h_statistic to test_public_functions.py
mayer79 a3aaed4
Possession apostrophs
mayer79 95ed5de
Add docu
mayer79 7f4527f
Add entry to classes.rst
mayer79 0ee0f7a
reorder position in classes.rst
mayer79 309edef
Merge branch 'main' into friedmans-h
mayer79 a93a4c9
safe assign and indexing have moved
mayer79 6b63a55
Fix doctest failure
mayer79 3923d73
fix docstring failure attempt 2
mayer79 d406dc6
doc tests do not seem to allow multiline command in parantheses
mayer79 70b9f68
docstring checks do not like black
mayer79 3ed07af
assign result of plot in docu
mayer79 c7b798e
superfluous newline in docstring of function
mayer79 7e4ff8b
Replace plot by print()
mayer79 c5e56ab
rst docu: reformat code
mayer79 9ed9b2b
Change intendation in example output of rst docu
mayer79 0b29fc8
doctest failure
mayer79 5845331
documentation: image is better than print
mayer79 91c2d65
Intendation fix
mayer79 25a3f6d
fix doctest issues
mayer79 72ce9c7
Review Lorentzen
mayer79 cde7dae
switch to pred_fun argument in helper
mayer79 a3f1beb
Initialize all resulting numpy arrays
mayer79 ffc6b77
Too long line in docstring
mayer79 ef20380
Doctest failure
mayer79 3a34272
move example from rst file to plot_partial_dependence.py
mayer79 4658ee6
Reformat example output
mayer79 9bc1ff8
Fixing plot
mayer79 9b8a1d6
Fix reference
mayer79 a5cfa08
maybe we need copy()
mayer79 f36a913
Try second copy()
mayer79 54e98c8
Remove copy again
mayer79 131910b
fix dupe index issue with old pandas
mayer79 00f0eed
Fix typo
mayer79 90ffd7b
add column names to pandas unit test
mayer79 387d997
Fix problem related to #28931
mayer79 ea50a48
merge
mayer79 00a4f19
Merge branch 'main' into friedmans-h and add changelog entry
mayer79 9f58f8a
Review of docu
mayer79 00d0f63
Fix findings of docu review
mayer79 275c889
revert change in plot_partial_dependence.py
mayer79 File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
|
||
.. _h_statistic: | ||
|
||
=============================================================== | ||
Friedman and Popescu's H-Statistic | ||
=============================================================== | ||
|
||
.. currentmodule:: sklearn.inspection | ||
|
||
What is the difference between a white box model (think linear model) and a black box model (e.g. boosted trees)? | ||
One main difference is the many and complicated interaction effects of the latter. | ||
|
||
Such interaction effects can be visualized by two-dimensional or stratified | ||
partial dependence plots (PDP). But how to figure out between *which feature pairs* | ||
the strongest interactions occur? | ||
|
||
One approach is to study pairwise H-statistics, introduced by Friedman and Popescu | ||
in [F2008]_. The H-statistic of two features provides the proportion of effect | ||
variability of the two features coming from their pairwise interaction. | ||
|
||
The figure below shows H-statistics and their unnormalized counterparts for | ||
the bike sharing dataset, with a | ||
:class:`~sklearn.ensemble.HistGradientBoostingRegressor`: | ||
|
||
.. figure:: ../auto_examples/inspection/images/sphx_glr_plot_partial_dependence_010.png | ||
:target: ../auto_examples/inspection/plot_partial_dependence.html | ||
:align: center | ||
:scale: 70 | ||
|
||
The statistics have been computed for the five most important features. | ||
|
||
Mathematical definition | ||
======================= | ||
|
||
**Partial dependence** | ||
|
||
Let :math:`F: \mathbb{R}^p \to \mathbb{R}` denote the prediction function that | ||
maps the :math:`p`-dimensional feature vector :math:`\mathbf x = (x_1, \dots, x_p)` | ||
to its prediction. | ||
Furthermore, let :math:`F_s(\mathbf x_s) = E_{\mathbf x_{\setminus s}}(F(\mathbf x_s, \mathbf x_{\setminus s}))` | ||
be the partial dependence function of :math:`F` on the feature subset | ||
:math:`\mathbf x_s`, where :math:`s \subseteq \{1, \dots, p\}`, | ||
Here, the expectation runs over the joint marginal distribution of features | ||
:math:`\mathbf x_{\setminus s}` not in :math:`\mathbf x_s`. | ||
|
||
Given data, :math:`F_s(\mathbf x_s)` can be estimated by the empirical partial | ||
dependence function | ||
|
||
.. math:: | ||
\hat F_s(\mathbf x_s) = \frac{1}{n} \sum_{i = 1}^n F(\mathbf x_s, \mathbf x_{i \setminus s}), | ||
|
||
where :math:`\mathbf x_{i\setminus s}`, :math:`i = 1, \dots, n`, | ||
are the observed values of :math:`\mathbf x_{\setminus s}` in some "background" dataset. | ||
|
||
**Pairwise H-statistic** | ||
|
||
Following [F2008]_, if there are no interaction effects between features | ||
:math:`x_j` and :math:`x_k`, their two-dimensional partial dependence function | ||
:math:`F_{jk}` can be written as the sum of the univariate partial dependencies, i.e., | ||
|
||
.. math:: | ||
F_{jk}(x_j, x_k) = F_j(x_j) + F_k(x_k). | ||
|
||
Correspondingly, Friedman and Popescu's H-statistic of pairwise interaction strength | ||
is defined as | ||
|
||
.. math:: | ||
|
||
H_{jk}^2 = A_{jk} / B_{jk}, | ||
|
||
where | ||
|
||
.. math:: | ||
|
||
A_{jk} = \frac{1}{n} \sum_{i = 1}^n\big[\hat F_{jk}(x_{ij}, x_{ik}) - \hat F_j(x_{ij}) - \hat F_k(x_{ik})\big]^2 | ||
|
||
and | ||
|
||
.. math:: | ||
|
||
B_{jk} = \frac{1}{n} \sum_{i = 1}^n\big[\hat F_{jk}(x_{ij}, x_{ik})\big]^2. | ||
|
||
Remarks | ||
======= | ||
|
||
1. Partial dependence functions are centered to mean 0. | ||
2. Partial dependence functions are evaluated over the data distribution. | ||
This is different to partial dependence plots, where one uses a fixed grid. | ||
3. Weighted versions follow by replacing all arithmetic means by corresponding weighted averages. | ||
4. Multi-output prediction (e.g., probabilistic classification) is handled component-wise. | ||
5. Due to undesired extrapolation of partial dependence functions, values above 1 may occur. | ||
|
||
Interpretation | ||
============== | ||
|
||
* The statistic provides the proportion of joint effect variability explained by the interaction. | ||
* A value of 0 means no interaction. | ||
* If main effects are weak, a small interaction effect can get a high value of the statistic. | ||
Therefore, it often makes sense to study unnormalized statistics :math:`A_{jk}` or to | ||
stay on the scale of the prediction :math:`\sqrt{A_{jk}}`. | ||
|
||
Workflow | ||
======== | ||
|
||
Calculating all pairwise H-statistics has computational complexity of :math:`O(n^2p^2)`. | ||
Therefore, our implementation randomly selects ``n_max = 500`` rows from the provided dataset ``X``. | ||
Furthermore, if the number of features :math:`p` is large, use some feature importance measure | ||
to select the most important features and pass them via the ``features=None`` argument. | ||
|
||
Limitations | ||
=========== | ||
|
||
1. H-statistics are based on partial dependence estimates. Therefore, they are | ||
just as good or poor as these. The major problem of partial dependence is | ||
the application of the model to unseen and/or impossible feature combinations. | ||
H-statistics, which should actually lie in the range between 0 and 1, | ||
can become greater than 1 in extreme cases. | ||
2. Due to their computational complexity, H-statistics are usually evaluated on | ||
relatively small subsets of the data. Consequently, the estimates are | ||
typically not very robust. | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_inspection_plot_partial_dependence.py` | ||
|
||
.. topic:: References | ||
|
||
.. [F2008] J. H. Friedman and B. E. Popescu, | ||
"Predictive Learning via Rule Ensembles", | ||
The Annals of Applied Statistics, 2(3), 916-954, 2008. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for an ndarray here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to sort the labels later in the same order as the values. So I am undoing this change. (Actually, a corresponding pandas code would be extremely compact.)