Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

Open
adam2392 opened this issue Apr 4, 2024 · 8 comments
Labels
Needs Investigation Issue requires investigation

Comments

@adam2392
Copy link
Contributor

adam2392 commented Apr 4, 2024

Currently, when missing-values occur only in the testing dataset for constructing a RandomForest, there is a policy that the missing values are sent to the child with the most samples. This amounts to in some sense imputing the missing-value data using the data in the child with the most samples. An issue here is that this may bias the tree prediction towards say a class in the training dataset with more samples.

For example, say there are 1000 training samples of class 1 and 10 training samples of class 0, and then during test time there are some NaNs. The predictions would then bias towards class 1, whereas it should really be uninformative because the NaNs during test time are treated as missing completely at random.

Proposed Solution

However, an alternative and more sensible strategy is that when NaNs are not enountered during training, but show up in testing data, they should just be sent stochastically down the tree using weights:

  • p_left_child = n_left_samples / (n_left_samples + n_right_samples)
  • p_right_child = n_right_samples / (n_left_samples + n_right_samples)

This ensures that there is no bias towards the class "with more samples". This can be implemented by allowing the value of missing_go_to_left (

unsigned char missing_go_to_left # Controls if missing values go to the left node.
) to be 2. If the value is 2, it implies that missing-values were not observed during training time, and thus should be stochastically set.

Overall, it's a very simple change, and I can also implement relevant unit-tests.

cc: @thomasjpfan who implemented the original missing-value support in RandomForest.

Related

xref: This policy will also impact #27966 and #28268

This is also an issue in other estimators that handle NaNs: https://scikit-learn.org/stable/modules/ensemble.html#missing-values-support

@github-actions github-actions bot added the Needs Triage Issue requires triage label Apr 4, 2024
@thomasjpfan
Copy link
Member

For random forest, I used the same behavior in HistGradientBoosting, which adopted the behavior from LightGBM. @NicolasHug Do you have more context on the missing value behavior for HistGradientBoosting?

@adam2392
Copy link
Contributor Author

adam2392 commented Apr 8, 2024

If it helps the discussion, I found this old issue in LightGBM, which seems to reflect their docs (I'm unsure cuz I can't find a specific line mentioning how they exactly treat this edge case).

To summarize, it seems the old issue stated they replaced missing-values during test (that wasn't seen during training) to '0'. This seems like a reasonable imputation if the feature was 0-centered, but seems a bit odd overall. Having a stochastic treatment as I specified in the proposed soln would:

  1. still work similarly to replacing with 0's when the data is 0-centered
  2. (more importantly) would work better than replacing with 0's when the data is not 0-centered because of the issues outlined in the issue description.

Lmk if there's some analysis I'm missing wrt LightGBM on perhaps why this behavior is the way it is.

@glemaitre
Copy link
Member

Is there any overhead by having the stochastic process? For instance, in case of tie breaking in nearest-neighbors, we did not use a stochastic process because it would be detrimental in terms of computing performance with a huge regression.

@adam2392
Copy link
Contributor Author

adam2392 commented Apr 8, 2024

I imagine there is some overhead due to the potential to query a RNG many times. Tho this would potentially be also an edge case since it implies there are many missing-values seen during test that were not seen during training. I.e. the only times we would stochastic process left/right is when a specific feature is used that does not have missing values encountered for that specific tree's node depth in the training data. I personally think this edge-case is less worrisome than the bias point I raised, but it's plausible to occur in practice of course.

I think the situation you describe is a good one tho and makes the point that the situation may be more complex. It is presumable this could occur in practice tho. Some ideas:

  1. We treat it like an edge-case and simply the fact that it would not come up as often as tie-breaking in nearest-neighbors and just allow stochastic processing the test data
  2. We allow the user to specify such behavior during training time, so that way missing_go_to_left can be set accordingly if there are no missing-values encountered.
  3. We allow the user to specify such behavior during runtime. I'm unsure how we could best implement this behavior tho, so would have to take a look at the code.

@betatim
Copy link
Member

betatim commented Apr 9, 2024

If it is easy to implement the "make a weighted random choice" option (what you propose) then we could run a benchmark to see what the performance hit is for different cases (many missing values in many features, many missing values in a few features, few missing values in few features, etc). That would give us a way to decide if performance is/isn't something to worry about when making a decision.

@glemaitre
Copy link
Member

In terms of an application use case, I'm also wondering if we should not error/warn if a user starts to provide missing values at test time while we did not see any during training. @adam2392 do you have a legitimate use case where this is kind of normal.

@adam2392
Copy link
Contributor Author

I think a warning would be nice but an error message might be overkill because when a NaN pops up in the testing dataset (but not training), ideally our use case is we would like to ignore its effect. If we errored out, we would need to fill in the NaN value or drop the sample altogether.

OTOH, the issue rn is that by sending it down the child with more samples, it just biases the NaN towards a class (in classification). I also see a case where maybe one would like the current behavior. It all boils down to the assumption on why the NaNs show up. Ideally, both settings should be supported if it's not too complex.

I can do a quick implementation and try it out on a benchmark tho before continuing the discussion?

@NicolasHug
Copy link
Member

NicolasHug commented Apr 10, 2024

I don't think we should raise a warning, unless we can identify clear cases where missing values during test time is clearly unexpected.

There's no right or wrong strategy here I believe but we should note that the strategy suggested above means that predict(X) called twice on the same X may lead to different predictions. Do we have other predictors in scikit-learn that are inherently random (excluding tie-breaking etc.)?

@betatim betatim added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Investigation Issue requires investigation
Projects
None yet
Development

No branches or pull requests

5 participants