ENH Add Friedman's H-squared #28375

mayer79 · 2024-02-06T21:04:51Z

Reference Issues/PRs

Implements #22383

What does this implement/fix? Explain your changes.

This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.

(My) reference implementation: https://github.com/mayer79/hstats

Any other comments?

The implementation also works for multi-output or multi-class classification.
Plots might follow in a later PR.
Univariate H-statistics also exist, but I have not added them (yet). They measure the proportion of prediction variability only explained by interactions involving feature j. We need to keep this in mind when thinking about the output API.

github-actions · 2024-02-06T21:06:06Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 131910b. Link to the linter CI: here}

lorentzenchr · 2024-02-08T11:36:52Z

@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute.

@amueller @glemaitre @adrinjalali ping as this might interest you.

lorentzenchr

A first quick pass. Maybe, _partial_dependence_brute can help with the tests.

sklearn/inspection/_friedmans_h.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

sklearn/inspection/_friedmans_h.py

lorentzenchr · 2024-02-09T14:36:51Z

I keep struggling over the fact that "Friedman's H-statistic" is actually an H-squared.

The naming will pop up during further review anyway. One possibility would be h2_statistics.

lorentzenchr

Another round. Review of user guide is still missing.

sklearn/inspection/_h_statistic.py

sklearn/inspection/tests/test_h_statistic.py

mayer79 · 2024-05-01T07:49:11Z

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

lorentzenchr · 2024-05-01T12:05:45Z

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

You can see in the CI that Linux_Docker debian_atlas_32bit installed pandas 1.1.5. The tests also tell that the error from pandas

ValueError: cannot reindex from a duplicate axis

stems from _safe_assign from sklearn/utils/_indexing. Maybe it's related to the index of a pandas series. You could try with the same (old) version of pandas.

mayer79 · 2024-05-01T18:57:37Z

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

You can see in the CI that Linux_Docker debian_atlas_32bit installed pandas 1.1.5. The tests also tell that the error from pandas

ValueError: cannot reindex from a duplicate axis

stems from _safe_assign from sklearn/utils/_indexing. Maybe it's related to the index of a pandas series. You could try with the same (old) version of pandas.

You are right: _safe_indexing() requires Pandas>=2.0 in this application...

What magic things happen?

Define a small dataset.

import pandas as pd
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign

X = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
X

Take unique grid rows in a list of one or two columns, e.g., [1] (= y):

grid = _safe_indexing(X, [1], axis=1)
try:
    ax = 0 if grid.shape[1] > 1 else None  # np.unique works better in 1 dim
    _, ix, ix_reconstruct = np.unique(
        grid, return_index=True, return_inverse=True, axis=ax
    )
    grid = _safe_indexing(grid, ix, axis=0)
    compressed_grid = True
except (TypeError, np.AxisError):
    compressed_grid = False
grid

We stack the original data as many times as we have rows in the grid

X_stacked = _safe_indexing(X, np.tile(np.arange(3), 2), axis=0)
X_stacked

Then we repeat each row in the grid as many times as we have rows in the original data

grid_stacked = _safe_indexing(grid, np.repeat(np.arange(2), 3), axis=0)
grid_stacked

Finally, in the stacked background data, we replace all values in the grid columns by grid_stacked:

_safe_assign(X_stacked, values=grid_stacked, column_indexer=[1])
X_stacked

The last step fails for pandas versions < 2, most probably because the column name already exists.

Notes

The last step also fails for polars data. We might put some dedication to _safe_assign().
It would be great to have a _safe_unique() function that would return the indices and reverse indices of unique rows for numpy, pandas, and polars data. This would replace the hacky try/error block.

ogrisel · 2024-05-02T14:49:46Z

The last step fails for pandas versions < 2, most probably because the column name already exists.

What's the traceback you get?

Could you please consolidate the snippets into a minimal reproducer for #28931?

The last step also fails for polars data. We might put some dedication to _save_assign().

What is the traceback you get with polars?

mayer79 · 2024-05-02T17:10:14Z

@ogrisel Thanks for your assisstance. I had a look at _safe_assign(): It actually supports only pandas and numpy. So no wonder it fails for polars :-).

import polars as pl
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign # 1.5dev

X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
_safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])

# Traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
      [3](vscode-notebook-cell:?execution_count=4&line=3) from sklearn.utils._indexing import _safe_indexing, _safe_assign
      [5](vscode-notebook-cell:?execution_count=4&line=5) X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
----> [6](vscode-notebook-cell:?execution_count=4&line=6) _safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])

File [~\scikit-learn\sklearn\utils\_indexing.py:303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303), in _safe_assign(X, values, row_indexer, column_indexer)
    [301](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:301)         X.iloc[row_indexer, column_indexer] = values
    [302](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:302) else:  # numpy array or sparse matrix
--> [303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303)     X[row_indexer, column_indexer] = values

File [c:\Users\Michael\scikit-learn\.venv\Lib\site-packages\polars\dataframe\frame.py:1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810), in DataFrame.__setitem__(self, key, value)
   [1808](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1808) else:
   [1809](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1809)     msg = f"unexpected column selection {col_selection!r}"
-> [1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810)     raise TypeError(msg)
   [1812](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1812) # dispatch to __setitem__ of Series to do modification
   [1813](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1813) s[row_selection] = value

TypeError: unexpected column selection [0]

Regarding pandas, I will add a comment to #28931

ogrisel · 2024-05-03T13:52:34Z

sklearn/inspection/_h_statistic.py

+    grid_stacked = _safe_indexing(grid, np.repeat(np.arange(n_grid), n), axis=0)
+
+    if hasattr(X, "iloc"):  # pandas<2 does not allow "values" to have repeated indices
+        grid_stacked = grid_stacked.reset_index(drop=True)


I have the feeling that those lines belong to _safe_indexing. The point of _safe_indexing is precisely to hide the pandas-specific things.

It might be worth opening in a side PR with a dedicated non-regression test for #28931.

Feel free to keep temporarily keep those lines here but with a TODO comment that links to #28931 and/or the new side PR.

Okay! Do you have a good idea how to get rid of the still quite hacky

try: ax = 0 if grid.shape[1] > 1 else None # np.unique works better in 1 dim _, ix, ix_reconstruct = np.unique( grid, return_index=True, return_inverse=True, axis=ax ) grid = _safe_indexing(grid, ix, axis=0) compressed_grid = True except (TypeError, np.AxisError): compressed_grid = False pd_values = _calculate_pd_brute_fast( pred_fun, X=X, feature_indices=feature_indices, grid=grid, sample_weight=sample_weight, reduce_binary=reduce_binary, ) if compressed_grid: pd_values = pd_values[ix_reconstruct]

The grid equals one or two columns from the data. The snippet tries to remove duplicated rows (to save a lot of time in calculating partial dependence for discrete features). The last step merges the result back to the original row order.

I am thinking of something like this:

from sklearn.utils._indexing import _safe_indexing def _safe_unique(X): axis=None if isinstance(X, np.ndarray): groups = X.copy() if len(X.shape) > 1 and X.shape[1] > 1: axis = 0 # if is polars: do some crazy stuff if hasattr(X, "iloc"): groups = X.groupby(X.columns.to_list(), sort=False, dropna=False).ngroup() _, ix, ix_reconstruct = np.unique( groups, return_index=True, return_inverse=True, axis=axis ) return(_safe_indexing(X, indices=ix, axis=0), ix_reconstruct)

Add Friedman's H-squared of pairwise interaction statistics

2f06cbc

github-actions bot added the module:inspection label Feb 6, 2024

mayer79 and others added 8 commits February 6, 2024 22:10

run black

38f3d6c

np.unique() does not work well for non-standard values

f173c51

Reorganize imports

2e7b731

run ruff

64581e2

check weights and is fitted

8e50637

Replace compression logic by try except

efa4071

Switch to Bunch output

f53b838

Clip small numerators

032e661

lorentzenchr marked this pull request as draft February 8, 2024 10:23

lorentzenchr changed the title ~~ENH Add Friedman's H-squared (WIP - DO NOT MERGE)~~ ENH Add Friedman's H-squared Feb 8, 2024

lorentzenchr reviewed Feb 8, 2024

View reviewed changes

lorentzenchr linked an issue Feb 8, 2024 that may be closed by this pull request

Add friedman's H statistic #22383

Open

mayer79 and others added 2 commits February 9, 2024 11:35

Apply suggestions from code review

1335f05

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

use sample_without_replacement, plus some docstring improvements

a28e8e0

lorentzenchr reviewed Feb 9, 2024

View reviewed changes

sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved

mayer79 added 2 commits February 9, 2024 12:53

fix existing problems

66f7bdd

Merge branch 'main' into friedmans-h

6db5ac7

lorentzenchr reviewed Feb 9, 2024

View reviewed changes

sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved

Rename things and fix imports

5beb941

mayer79 added 6 commits February 10, 2024 11:16

More compact output organization, faster example

9823fe5

Add formula to docstring

5aebb4c

Add preliminary unit tests

36ed7a8

Compare against two R packages

27e3540

Split calculate_pd_over_data into two plus some optimizations

e5f6e53

Fix typos in docstring

a2b659d

mayer79 added 8 commits April 19, 2024 18:32

superfluous newline in docstring of function

c7b798e

Replace plot by print()

7e4ff8b

rst docu: reformat code

c5e56ab

Change intendation in example output of rst docu

9ed9b2b

doctest failure

0b29fc8

documentation: image is better than print

5845331

Intendation fix

91c2d65

fix doctest issues

25a3f6d

lorentzenchr reviewed Apr 20, 2024

View reviewed changes

mayer79 added 12 commits April 26, 2024 08:32

Review Lorentzen

72ce9c7

switch to pred_fun argument in helper

cde7dae

Initialize all resulting numpy arrays

a3f1beb

Too long line in docstring

ffc6b77

Doctest failure

ef20380

move example from rst file to plot_partial_dependence.py

3a34272

Reformat example output

4658ee6

Fixing plot

9bc1ff8

Fix reference

9b8a1d6

maybe we need copy()

a5cfa08

Try second copy()

f36a913

Remove copy again

54e98c8

glemaitre self-requested a review April 30, 2024 15:30

lorentzenchr mentioned this pull request May 2, 2024

BUG internal indexing tools trigger error with pandas < 2.0.0 #28931

Open

fix dupe index issue with old pandas

131910b

ogrisel reviewed May 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add Friedman's H-squared #28375

ENH Add Friedman's H-squared #28375

mayer79 commented Feb 6, 2024 •

edited

github-actions bot commented Feb 6, 2024 •

edited

lorentzenchr commented Feb 8, 2024

lorentzenchr left a comment

lorentzenchr commented Feb 9, 2024 •

edited

lorentzenchr left a comment

mayer79 commented May 1, 2024

lorentzenchr commented May 1, 2024

mayer79 commented May 1, 2024 •

edited

ogrisel commented May 2, 2024

mayer79 commented May 2, 2024 •

edited

ogrisel May 3, 2024

ogrisel May 3, 2024

mayer79 May 4, 2024 •

edited

ENH Add Friedman's H-squared #28375

Are you sure you want to change the base?

ENH Add Friedman's H-squared #28375

Conversation

mayer79 commented Feb 6, 2024 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Feb 6, 2024 • edited

✔️ Linting Passed

lorentzenchr commented Feb 8, 2024

lorentzenchr left a comment

Choose a reason for hiding this comment

lorentzenchr commented Feb 9, 2024 • edited

lorentzenchr left a comment

Choose a reason for hiding this comment

mayer79 commented May 1, 2024

lorentzenchr commented May 1, 2024

mayer79 commented May 1, 2024 • edited

ogrisel commented May 2, 2024

mayer79 commented May 2, 2024 • edited

ogrisel May 3, 2024

Choose a reason for hiding this comment

ogrisel May 3, 2024

Choose a reason for hiding this comment

mayer79 May 4, 2024 • edited

Choose a reason for hiding this comment

mayer79 commented Feb 6, 2024 •

edited

github-actions bot commented Feb 6, 2024 •

edited

lorentzenchr commented Feb 9, 2024 •

edited

mayer79 commented May 1, 2024 •

edited

mayer79 commented May 2, 2024 •

edited

mayer79 May 4, 2024 •

edited