Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: increase transparency of background dataset sub-sampling #3461

Open
3 of 4 tasks
jyliuu opened this issue Jan 18, 2024 · 6 comments · May be fixed by #3650
Open
3 of 4 tasks

ENH: increase transparency of background dataset sub-sampling #3461

jyliuu opened this issue Jan 18, 2024 · 6 comments · May be fixed by #3650
Labels
enhancement Indicates new feature requests good first issue This is a fix that might be easier for someone to do as a first contribution

Comments

@jyliuu
Copy link

jyliuu commented Jan 18, 2024

Issue Description

Given $x$ which is the sample that we wish to explain, we can compute the Shapley values of that sample using a background sample $x^b$. By providing the Explainer class with background data, it should compute the Shapley values for each sample in the background the background data and then take the average, which will be an approximation to the interventional SHAP.

The averaging procedure means that if I for example split my background data in half, A and B, then I should be able to call explainer on both A and B to obtain the averaged SHAP, a and b, for each half. If I now take (a + b)/2, then this should equal calling SHAP on the entire dataset to begin with.

From my experimentation, it seems that if the background dataset is over 100 samples, then it becomes inconsistent i.e. (a+b)/2 is not equal to the interventional approximation on the entire background dataset. However, the formula holds for datasets under 100 samples.

Minimal Reproducible Example

import numpy as np
import pandas as pd
import xgboost

import shap

rng = np.random.default_rng(42)
N = 1000
M = 2

X = rng.standard_normal(size=(N, 2))
X[:, 0] = 0.2*X[:, 1] + X[:, 0]
y = -2*X[:, 0] + X[:, 1] + 0.5*X[:, 0]*X[:, 1]

X = pd.DataFrame(X, columns=["X1", "X2"])


model = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=3)
model.fit(X, y)


def get_shap_values(model, X, sample):
    explainer = shap.TreeExplainer(
        model,
        X,
        feature_perturbation="interventional",
    )
    explanation = explainer(sample)

    expected_value = explanation.base_values[0]
    shap_values = explanation.values[0]
    return shap_values, expected_value

# Consistent for when the background data has 100 or less samples


for i in range(50, 53): # i is number of samples in each half 
    midpoint = i
    double_mid = midpoint * 2
    # shap on two halves
    shap_values1, expected_value1 = get_shap_values(model, X.loc[1:midpoint, :], X.loc[[0], :])
    shap_values2, expected_value2 = get_shap_values(model, X.loc[(midpoint+1):double_mid, :], X.loc[[0], :])
    # Shap on full background data
    shap_values, expected_value = get_shap_values(model, X.loc[1:double_mid, :], X.loc[[0], :])

    print(len(X.loc[1:midpoint, :]), len(X.loc[(midpoint+1):double_mid, :]), len(X.loc[1:double_mid, :]))
    print(shap_values, (shap_values1 + shap_values2) / 2) # inconsistent here when i > 50

Traceback

No response

Expected Behavior

In the for loop the values should equal each other

Bug report checklist

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest release of shap.
  • I have confirmed this bug exists on the master branch of shap.
  • I'd be interested in making a PR to fix this bug

Installed Versions

0.44.0

@jyliuu jyliuu added the bug Indicates an unexpected problem or unintended behaviour label Jan 18, 2024
@CloseChoice
Copy link
Collaborator

CloseChoice commented Jan 20, 2024

Thanks for the report and your effort to investigate this. Your description is absolutely accurate and the reason for this is the default in the tabular masker.

Here is an issue where this problem was already discussed including workaround: #3174.

We probably should throw at least a warning if max_samples < len(X). What do you thing @connortann ? This issue seems to come up and is confusing users.

@connortann
Copy link
Collaborator

connortann commented Jan 23, 2024

I agree with your analysis, this seems to be a consequence of sampling. I'll remove the bug label as I think this is intended behaviour.

We probably should throw at least a warning if max_samples < len(X)

I'm not sure if I agree. To me, warnings are generally used to indicate undesirable situations in which the user should probably update their code to fix the warning. In this case I think for the majority of users the subsampling is expected and desirable behaviour. Many parts of shap are sampling-based and only offer approximate results.

Would log.info() be more appropriate?

@connortann connortann added enhancement Indicates new feature requests question and removed bug Indicates an unexpected problem or unintended behaviour labels Jan 23, 2024
@CloseChoice
Copy link
Collaborator

logging.info is fine for me as well. I would be fine with a print as well, just to make sure that users do not have to investigate a couple hours to find the reason for the inconsistency between values and theory

@connortann connortann changed the title BUG: Computation of interventional SHAP is inconsistent with theory ENH: increase transparency of background dataset sub-sampling Jan 24, 2024
@connortann connortann added this to the 0.45.0 milestone Jan 24, 2024
@connortann
Copy link
Collaborator

I would much prefer logging over print statements, as prints are much harder to configure and disable. I think adding a print would risk annoying a large majority of shap users.

I've renamed the title accordingly to reflect the plan.

@CloseChoice CloseChoice added good first issue This is a fix that might be easier for someone to do as a first contribution and removed question labels Feb 15, 2024
@connortann connortann removed this from the 0.45.0 milestone Mar 6, 2024
@jcoding2022
Copy link

I am also confused about the background dataset and would like to ask a follow-up question, if I may.

Suppose I use shap.TreeExplainer to explain predictions from my LightGBM model for a classification task. I am interested in model_output = "probability", so according to the documentation, I need to set feature_perturbation="interventional" and specify a background dataset. Given that I have training data, validation data, and test data, where should I pick the background dataset from - training, validation, or test? It says in the documentation that "Anywhere from 100 to 1000 random background samples are good sizes to use", how should I pick the samples? Should I fix the random samples so that the background dataset won't change regardless of the dataset (train, validation, test) I use?

@CloseChoice
Copy link
Collaborator

This is not strictly on topic, so if you have follow up questions to my answer please open a discussion or search for one of the topics where this is already discussed.

First, I do not believe there is a real answer to your question, there is no real backtesting one can do for shap values, etc. So one just has to take various considerations into account:

  • do you want to have deterministic shap values? If yes, then fixing the background dataset makes sense.
  • The background dataset is just used to calculate the baseline, so any number where this average is seen to converge should be sufficiently large. You can try to test if this is the case by keeping the dataset to explain constant and change the background dataset and check how large the differences in the shap values (or even simpler: just in the the expected value) are. For iid sampling and a sufficiently diverse background dataset a number of 100 to 1000 should suffice, I wouldn't expect much differences between train, test or validation. If so, that I would rather check if your splitting is chosen correctly.

@CloseChoice CloseChoice linked a pull request May 11, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Indicates new feature requests good first issue This is a fix that might be easier for someone to do as a first contribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants