Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add Friedman's H-squared #28375

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

Conversation

mayer79
Copy link
Contributor

@mayer79 mayer79 commented Feb 6, 2024

Reference Issues/PRs

Implements #22383

What does this implement/fix? Explain your changes.

@lorentzenchr

This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.

(My) reference implementation: https://github.com/mayer79/hstats

Any other comments?

  • The implementation also works for multi-output or multi-class classification.
  • Plots might follow in a later PR.
  • Univariate H-statistics also exist, but I have not added them (yet). They measure the proportion of prediction variability only explained by interactions involving feature j. We need to keep this in mind when thinking about the output API.

Copy link

github-actions bot commented Feb 6, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 131910b. Link to the linter CI: here

@lorentzenchr lorentzenchr marked this pull request as draft February 8, 2024 10:23
@lorentzenchr lorentzenchr changed the title ENH Add Friedman's H-squared (WIP - DO NOT MERGE) ENH Add Friedman's H-squared Feb 8, 2024
@lorentzenchr
Copy link
Member

@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute.

@amueller @glemaitre @adrinjalali ping as this might interest you.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first quick pass. Maybe, _partial_dependence_brute can help with the tests.

sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved
@lorentzenchr lorentzenchr linked an issue Feb 8, 2024 that may be closed by this pull request
@lorentzenchr
Copy link
Member

lorentzenchr commented Feb 9, 2024

I keep struggling over the fact that "Friedman's H-statistic" is actually an H-squared.

The naming will pop up during further review anyway. One possibility would be h2_statistics.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round. Review of user guide is still missing.

sklearn/inspection/_h_statistic.py Show resolved Hide resolved
sklearn/inspection/_h_statistic.py Outdated Show resolved Hide resolved
sklearn/inspection/_h_statistic.py Show resolved Hide resolved
sklearn/inspection/_h_statistic.py Outdated Show resolved Hide resolved
sklearn/inspection/_h_statistic.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_h_statistic.py Outdated Show resolved Hide resolved
sklearn/inspection/tests/test_h_statistic.py Outdated Show resolved Hide resolved
@glemaitre glemaitre self-requested a review April 30, 2024 15:30
@mayer79
Copy link
Contributor Author

mayer79 commented May 1, 2024

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

@lorentzenchr
Copy link
Member

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

You can see in the CI that Linux_Docker debian_atlas_32bit installed pandas 1.1.5. The tests also tell that the error from pandas

ValueError: cannot reindex from a duplicate axis

stems from _safe_assign from sklearn/utils/_indexing. Maybe it's related to the index of a pandas series. You could try with the same (old) version of pandas.

@mayer79
Copy link
Contributor Author

mayer79 commented May 1, 2024

No idea how to fix the failing checks... might it be due to some different pandas on Debian?

You can see in the CI that Linux_Docker debian_atlas_32bit installed pandas 1.1.5. The tests also tell that the error from pandas

ValueError: cannot reindex from a duplicate axis

stems from _safe_assign from sklearn/utils/_indexing. Maybe it's related to the index of a pandas series. You could try with the same (old) version of pandas.

You are right: _safe_indexing() requires Pandas>=2.0 in this application...

What magic things happen?

  1. Define a small dataset.
import pandas as pd
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign

X = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
X

image

  1. Take unique grid rows in a list of one or two columns, e.g., [1] (= y):
grid = _safe_indexing(X, [1], axis=1)
try:
    ax = 0 if grid.shape[1] > 1 else None  # np.unique works better in 1 dim
    _, ix, ix_reconstruct = np.unique(
        grid, return_index=True, return_inverse=True, axis=ax
    )
    grid = _safe_indexing(grid, ix, axis=0)
    compressed_grid = True
except (TypeError, np.AxisError):
    compressed_grid = False
grid 

image

  1. We stack the original data as many times as we have rows in the grid
X_stacked = _safe_indexing(X, np.tile(np.arange(3), 2), axis=0)
X_stacked

image

  1. Then we repeat each row in the grid as many times as we have rows in the original data
grid_stacked = _safe_indexing(grid, np.repeat(np.arange(2), 3), axis=0)
grid_stacked

image

  1. Finally, in the stacked background data, we replace all values in the grid columns by grid_stacked:
_safe_assign(X_stacked, values=grid_stacked, column_indexer=[1])
X_stacked

image

The last step fails for pandas versions < 2, most probably because the column name already exists.

Notes

  1. The last step also fails for polars data. We might put some dedication to _safe_assign().
  2. It would be great to have a _safe_unique() function that would return the indices and reverse indices of unique rows for numpy, pandas, and polars data. This would replace the hacky try/error block.

@ogrisel
Copy link
Member

ogrisel commented May 2, 2024

The last step fails for pandas versions < 2, most probably because the column name already exists.

What's the traceback you get?

Could you please consolidate the snippets into a minimal reproducer for #28931?

The last step also fails for polars data. We might put some dedication to _save_assign().

What is the traceback you get with polars?

@mayer79
Copy link
Contributor Author

mayer79 commented May 2, 2024

@ogrisel Thanks for your assisstance. I had a look at _safe_assign(): It actually supports only pandas and numpy. So no wonder it fails for polars :-).

import polars as pl
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign # 1.5dev

X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
_safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])

# Traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
      [3](vscode-notebook-cell:?execution_count=4&line=3) from sklearn.utils._indexing import _safe_indexing, _safe_assign
      [5](vscode-notebook-cell:?execution_count=4&line=5) X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
----> [6](vscode-notebook-cell:?execution_count=4&line=6) _safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])

File [~\scikit-learn\sklearn\utils\_indexing.py:303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303), in _safe_assign(X, values, row_indexer, column_indexer)
    [301](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:301)         X.iloc[row_indexer, column_indexer] = values
    [302](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:302) else:  # numpy array or sparse matrix
--> [303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303)     X[row_indexer, column_indexer] = values

File [c:\Users\Michael\scikit-learn\.venv\Lib\site-packages\polars\dataframe\frame.py:1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810), in DataFrame.__setitem__(self, key, value)
   [1808](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1808) else:
   [1809](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1809)     msg = f"unexpected column selection {col_selection!r}"
-> [1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810)     raise TypeError(msg)
   [1812](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1812) # dispatch to __setitem__ of Series to do modification
   [1813](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1813) s[row_selection] = value

TypeError: unexpected column selection [0]

Regarding pandas, I will add a comment to #28931

grid_stacked = _safe_indexing(grid, np.repeat(np.arange(n_grid), n), axis=0)

if hasattr(X, "iloc"): # pandas<2 does not allow "values" to have repeated indices
grid_stacked = grid_stacked.reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that those lines belong to _safe_indexing. The point of _safe_indexing is precisely to hide the pandas-specific things.

It might be worth opening in a side PR with a dedicated non-regression test for #28931.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to keep temporarily keep those lines here but with a TODO comment that links to #28931 and/or the new side PR.

Copy link
Contributor Author

@mayer79 mayer79 May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Do you have a good idea how to get rid of the still quite hacky

    try:
        ax = 0 if grid.shape[1] > 1 else None  # np.unique works better in 1 dim
        _, ix, ix_reconstruct = np.unique(
            grid, return_index=True, return_inverse=True, axis=ax
        )
        grid = _safe_indexing(grid, ix, axis=0)
        compressed_grid = True
    except (TypeError, np.AxisError):
        compressed_grid = False

    pd_values = _calculate_pd_brute_fast(
        pred_fun,
        X=X,
        feature_indices=feature_indices,
        grid=grid,
        sample_weight=sample_weight,
        reduce_binary=reduce_binary,
    )

    if compressed_grid:
        pd_values = pd_values[ix_reconstruct]

The grid equals one or two columns from the data. The snippet tries to remove duplicated rows (to save a lot of time in calculating partial dependence for discrete features). The last step merges the result back to the original row order.

I am thinking of something like this:

from sklearn.utils._indexing import _safe_indexing

def _safe_unique(X):
    axis=None
    if isinstance(X, np.ndarray):
        groups = X.copy()
        if len(X.shape) > 1 and X.shape[1] > 1:
            axis = 0
    # if is polars: do some crazy stuff
    if hasattr(X, "iloc"):
        groups = X.groupby(X.columns.to_list(), sort=False, dropna=False).ngroup()

    _, ix, ix_reconstruct = np.unique(
        groups, return_index=True, return_inverse=True, axis=axis
    )
    return(_safe_indexing(X, indices=ix, axis=0), ix_reconstruct)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add friedman's H statistic
3 participants