New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add Friedman's H-squared #28375
base: main
Are you sure you want to change the base?
ENH Add Friedman's H-squared #28375
Conversation
@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute. @amueller @glemaitre @adrinjalali ping as this might interest you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first quick pass. Maybe, _partial_dependence_brute
can help with the tests.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
The naming will pop up during further review anyway. One possibility would be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round. Review of user guide is still missing.
No idea how to fix the failing checks... might it be due to some different pandas on Debian? |
You can see in the CI that Linux_Docker debian_atlas_32bit installed pandas 1.1.5. The tests also tell that the error from pandas
stems from |
You are right: What magic things happen?
import pandas as pd
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign
X = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
X
grid = _safe_indexing(X, [1], axis=1)
try:
ax = 0 if grid.shape[1] > 1 else None # np.unique works better in 1 dim
_, ix, ix_reconstruct = np.unique(
grid, return_index=True, return_inverse=True, axis=ax
)
grid = _safe_indexing(grid, ix, axis=0)
compressed_grid = True
except (TypeError, np.AxisError):
compressed_grid = False
grid
X_stacked = _safe_indexing(X, np.tile(np.arange(3), 2), axis=0)
X_stacked
grid_stacked = _safe_indexing(grid, np.repeat(np.arange(2), 3), axis=0)
grid_stacked
_safe_assign(X_stacked, values=grid_stacked, column_indexer=[1])
X_stacked The last step fails for pandas versions < 2, most probably because the column name already exists. Notes
|
What's the traceback you get? Could you please consolidate the snippets into a minimal reproducer for #28931?
What is the traceback you get with polars? |
@ogrisel Thanks for your assisstance. I had a look at _safe_assign(): It actually supports only pandas and numpy. So no wonder it fails for polars :-). import polars as pl
import numpy as np
from sklearn.utils._indexing import _safe_indexing, _safe_assign # 1.5dev
X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
_safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])
# Traceback
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
[3](vscode-notebook-cell:?execution_count=4&line=3) from sklearn.utils._indexing import _safe_indexing, _safe_assign
[5](vscode-notebook-cell:?execution_count=4&line=5) X = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "b"]})
----> [6](vscode-notebook-cell:?execution_count=4&line=6) _safe_assign(X, values=np.array([1, 1, 1]), column_indexer=[0])
File [~\scikit-learn\sklearn\utils\_indexing.py:303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303), in _safe_assign(X, values, row_indexer, column_indexer)
[301](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:301) X.iloc[row_indexer, column_indexer] = values
[302](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:302) else: # numpy array or sparse matrix
--> [303](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Michael/Desktop/~/scikit-learn/sklearn/utils/_indexing.py:303) X[row_indexer, column_indexer] = values
File [c:\Users\Michael\scikit-learn\.venv\Lib\site-packages\polars\dataframe\frame.py:1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810), in DataFrame.__setitem__(self, key, value)
[1808](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1808) else:
[1809](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1809) msg = f"unexpected column selection {col_selection!r}"
-> [1810](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1810) raise TypeError(msg)
[1812](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1812) # dispatch to __setitem__ of Series to do modification
[1813](file:///C:/Users/Michael/scikit-learn/.venv/Lib/site-packages/polars/dataframe/frame.py:1813) s[row_selection] = value
TypeError: unexpected column selection [0] Regarding pandas, I will add a comment to #28931 |
grid_stacked = _safe_indexing(grid, np.repeat(np.arange(n_grid), n), axis=0) | ||
|
||
if hasattr(X, "iloc"): # pandas<2 does not allow "values" to have repeated indices | ||
grid_stacked = grid_stacked.reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that those lines belong to _safe_indexing
. The point of _safe_indexing
is precisely to hide the pandas-specific things.
It might be worth opening in a side PR with a dedicated non-regression test for #28931.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to keep temporarily keep those lines here but with a TODO comment that links to #28931 and/or the new side PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! Do you have a good idea how to get rid of the still quite hacky
try:
ax = 0 if grid.shape[1] > 1 else None # np.unique works better in 1 dim
_, ix, ix_reconstruct = np.unique(
grid, return_index=True, return_inverse=True, axis=ax
)
grid = _safe_indexing(grid, ix, axis=0)
compressed_grid = True
except (TypeError, np.AxisError):
compressed_grid = False
pd_values = _calculate_pd_brute_fast(
pred_fun,
X=X,
feature_indices=feature_indices,
grid=grid,
sample_weight=sample_weight,
reduce_binary=reduce_binary,
)
if compressed_grid:
pd_values = pd_values[ix_reconstruct]
The grid equals one or two columns from the data. The snippet tries to remove duplicated rows (to save a lot of time in calculating partial dependence for discrete features). The last step merges the result back to the original row order.
I am thinking of something like this:
from sklearn.utils._indexing import _safe_indexing
def _safe_unique(X):
axis=None
if isinstance(X, np.ndarray):
groups = X.copy()
if len(X.shape) > 1 and X.shape[1] > 1:
axis = 0
# if is polars: do some crazy stuff
if hasattr(X, "iloc"):
groups = X.groupby(X.columns.to_list(), sort=False, dropna=False).ngroup()
_, ix, ix_reconstruct = np.unique(
groups, return_index=True, return_inverse=True, axis=axis
)
return(_safe_indexing(X, indices=ix, axis=0), ix_reconstruct)
Reference Issues/PRs
Implements #22383
What does this implement/fix? Explain your changes.
@lorentzenchr
This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.
(My) reference implementation: https://github.com/mayer79/hstats
Any other comments?