improve transparency concerning background samples #3650

CloseChoice · 2024-05-11T07:41:58Z

Overview

Description of the changes proposed in this pull request:

Checklist

All pre-commit checks pass.
Unit tests added (if fixing a bug or adding a new feature)

connortann · 2024-05-14T16:52:26Z

shap/explainers/_explainer.py

+        if masker_length is not None and hasattr(self.masker, 'max_samples') and masker_length > self.masker.max_samples:
+            logging.warn(f"Your background dataset is larger than the max_samples {self.masker.max_samples}. You can hand over max_samples by explicitly "
+                         "defining the masker to use: "
+                         "`shap.Explainer(model, masker=maskers.Independent(X, max_samples=1000))`")


A few suggestions:

This seems possibly an odd place for the warning to be raised. Would it be more appropriate to raise it in the masker itself, where the subsampling actually occurs, for example in Tabluar.__init__()? I think this would be a bit tidier.

https://github.com/CloseChoice/shap/blob/b23058409eb05f5cb9e7e5cc7c59ea22b38a5695/shap/maskers/_tabular.py#L57-L58

It could help to adopt the common pattern of using a module-specific logger, e.g. log = logging.getLogger(__name__)

I believe .warn is deprecated in favour of .warning.

I think the warning message could perhaps explain the consequences of having more samples (i.e. that the dataset will just be subsampled). Otherwise, this warning might be taken as an error that something is wrong and needs to be fixed by the user. E.g.:

log.warning( "The background dataset exceeds the masker's `max_samples`, " "so will be subsampled from {n} rows to {max_samples} rows." )

I agree that it's an odd place but with this approach we do not need to repeat code in the maskers itself but have one place where this check happens. Would you rather duplicate the code in the maskers (I believe we currently have 2 maskers with the max_samples keyword)?

As for the other points I totally agree and will implement these when we've settled the point above.

Good point. On balance I'd probably prefer to log a message in both maskers. That way, the same warning is raised whether the old shap_values or the new Explainer interface is used.

One other thing actually, going back to the discussion on #3461 : I think there is a risk that excessive logging could be annoying for most users. We had originally proposed logging.INFO, but here is logging.WARNING. I think the latter will display in an IPython or Jupyter output by default.

Is this desirable? IMO the subsampling is pretty usual behaviour, so "INFO" feels more appropriate. Subsampling will be invoked probably the majority of the time, so it seems excessive to have a warning message visible on most runs of the package! I would think a visible warning would only show for things that are unusual / unintended / require user attention.

to be honest, might be that our behaviour is standard but I still think it's bad practice to change user behaviour without at least an info (I would even prefer a warning and think it would be better if the user manually configures the correct max_samples if the provided data exceeds the default.)

it's bad practice to change user behaviour without ...

Please would you elaborate on this point: does the subsampling code "change user behaviour"?

How does one use background dataset without sampling it down? It seems no matter what I do the background dataset gets subsampled.

When looking at the code (_tree.py) , almost the very first lines in the code (starting line 166) are

masker = data super().__init__(model, masker, feature_names=feature_names) if type(self.masker) is maskers.Independent: data = self.masker.data elif masker is not None: raise InvalidMaskerError(f"Unsupported masker type: {str(type(self.masker))}!")

The second line subsamples the 'masker' to 100 which then gets assigned to the data. This later on gets assigned to self.data (line 182) which then used to generate baseline on line 229

self.expected_value = self.model.predict(self.data).mean(0)

Is this the intended behavior or a bug?

A workaround is given in the warning on the line you are commenting right here. Isn't that working for you?

No, it isn't. I am using the command "shap.TreeExplainer" instead of "shap.Explainer". Does it make a difference? I've tried many different thing and nothing seems to work. The line with "super()." seems to downsample background data no matter what you do. The default value of 100 for "max_sample" ends up being used every time.

I think the reason is the statement on line 56 of "_tabular.py"

if hasattr(data, "shape") and data.shape[0] > max_samples: data = utils.sample(data, max_samples)

Unless "max_sample" can somehow be specified by user, which I don't see how, this statement will always execute with default which is 100.

It should work the same for the tree explainer as given in the code snippet. You will get the warning

f"Passing {self.data.shape[0]} background samples may lead to slow runtimes. Consider " "using shap.sample(data, 100) to create a smaller background data set."

but nothing should be subsampled. If not, could you please provide a minimal reproducible example and create a seperate ticket for that?

Please see #3680

improve transparency concerning background samples

a4ef740

CloseChoice marked this pull request as ready for review May 11, 2024 19:46

CloseChoice requested a review from connortann May 11, 2024 19:46

connortann reviewed May 14, 2024

View reviewed changes

connortann mentioned this pull request May 14, 2024

throw warning and update error message #3576

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve transparency concerning background samples #3650

improve transparency concerning background samples #3650

CloseChoice commented May 11, 2024 •

edited

connortann May 14, 2024 •

edited

CloseChoice May 14, 2024

connortann May 14, 2024 •

edited

CloseChoice May 14, 2024

connortann May 15, 2024 •

edited

rozeni89 May 30, 2024 •

edited

CloseChoice May 31, 2024

rozeni89 May 31, 2024 •

edited

CloseChoice Jun 3, 2024

rozeni89 Jun 3, 2024

improve transparency concerning background samples #3650

Are you sure you want to change the base?

improve transparency concerning background samples #3650

Conversation

CloseChoice commented May 11, 2024 • edited

Overview

Checklist

connortann May 14, 2024 • edited

Choose a reason for hiding this comment

CloseChoice May 14, 2024

Choose a reason for hiding this comment

connortann May 14, 2024 • edited

Choose a reason for hiding this comment

CloseChoice May 14, 2024

Choose a reason for hiding this comment

connortann May 15, 2024 • edited

Choose a reason for hiding this comment

rozeni89 May 30, 2024 • edited

Choose a reason for hiding this comment

CloseChoice May 31, 2024

Choose a reason for hiding this comment

rozeni89 May 31, 2024 • edited

Choose a reason for hiding this comment

CloseChoice Jun 3, 2024

Choose a reason for hiding this comment

rozeni89 Jun 3, 2024

Choose a reason for hiding this comment

CloseChoice commented May 11, 2024 •

edited

connortann May 14, 2024 •

edited

connortann May 14, 2024 •

edited

connortann May 15, 2024 •

edited

rozeni89 May 30, 2024 •

edited

rozeni89 May 31, 2024 •

edited