BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

adam2392 · 2024-04-04T17:49:11Z

Currently, when missing-values occur only in the testing dataset for constructing a RandomForest, there is a policy that the missing values are sent to the child with the most samples. This amounts to in some sense imputing the missing-value data using the data in the child with the most samples. An issue here is that this may bias the tree prediction towards say a class in the training dataset with more samples.

For example, say there are 1000 training samples of class 1 and 10 training samples of class 0, and then during test time there are some NaNs. The predictions would then bias towards class 1, whereas it should really be uninformative because the NaNs during test time are treated as missing completely at random.

Proposed Solution

However, an alternative and more sensible strategy is that when NaNs are not enountered during training, but show up in testing data, they should just be sent stochastically down the tree using weights:

p_left_child = n_left_samples / (n_left_samples + n_right_samples)
p_right_child = n_right_samples / (n_left_samples + n_right_samples)

This ensures that there is no bias towards the class "with more samples". This can be implemented by allowing the value of missing_go_to_left (

scikit-learn/sklearn/tree/_splitter.pxd

Line 28 in 6bf0ba5

    
           unsigned char missing_go_to_left  # Controls if missing values go to the left node.

) to be 2. If the value is 2, it implies that missing-values were not observed during training time, and thus should be stochastically set.

Overall, it's a very simple change, and I can also implement relevant unit-tests.

cc: @thomasjpfan who implemented the original missing-value support in RandomForest.

To summarize, it seems the old issue stated they replaced missing-values during test (that wasn't seen during training) to '0'. This seems like a reasonable imputation if the feature was 0-centered, but seems a bit odd overall. Having a stochastic treatment as I specified in the proposed soln would:

still work similarly to replacing with 0's when the data is 0-centered
(more importantly) would work better than replacing with 0's when the data is not 0-centered because of the issues outlined in the issue description.

Lmk if there's some analysis I'm missing wrt LightGBM on perhaps why this behavior is the way it is.

glemaitre · 2024-04-08T08:18:07Z

Is there any overhead by having the stochastic process? For instance, in case of tie breaking in nearest-neighbors, we did not use a stochastic process because it would be detrimental in terms of computing performance with a huge regression.

adam2392 · 2024-04-08T15:55:44Z

I imagine there is some overhead due to the potential to query a RNG many times. Tho this would potentially be also an edge case since it implies there are many missing-values seen during test that were not seen during training. I.e. the only times we would stochastic process left/right is when a specific feature is used that does not have missing values encountered for that specific tree's node depth in the training data. I personally think this edge-case is less worrisome than the bias point I raised, but it's plausible to occur in practice of course.

I think the situation you describe is a good one tho and makes the point that the situation may be more complex. It is presumable this could occur in practice tho. Some ideas:

We treat it like an edge-case and simply the fact that it would not come up as often as tie-breaking in nearest-neighbors and just allow stochastic processing the test data
We allow the user to specify such behavior during training time, so that way missing_go_to_left can be set accordingly if there are no missing-values encountered.
We allow the user to specify such behavior during runtime. I'm unsure how we could best implement this behavior tho, so would have to take a look at the code.

betatim · 2024-04-09T12:13:38Z

If it is easy to implement the "make a weighted random choice" option (what you propose) then we could run a benchmark to see what the performance hit is for different cases (many missing values in many features, many missing values in a few features, few missing values in few features, etc). That would give us a way to decide if performance is/isn't something to worry about when making a decision.

glemaitre · 2024-04-09T16:53:31Z

In terms of an application use case, I'm also wondering if we should not error/warn if a user starts to provide missing values at test time while we did not see any during training. @adam2392 do you have a legitimate use case where this is kind of normal.

adam2392 · 2024-04-10T05:09:24Z

I think a warning would be nice but an error message might be overkill because when a NaN pops up in the testing dataset (but not training), ideally our use case is we would like to ignore its effect. If we errored out, we would need to fill in the NaN value or drop the sample altogether.

OTOH, the issue rn is that by sending it down the child with more samples, it just biases the NaN towards a class (in classification). I also see a case where maybe one would like the current behavior. It all boils down to the assumption on why the NaNs show up. Ideally, both settings should be supported if it's not too complex.

I can do a quick implementation and try it out on a benchmark tho before continuing the discussion?

NicolasHug · 2024-04-10T11:44:33Z

I don't think we should raise a warning, unless we can identify clear cases where missing values during test time is clearly unexpected.

There's no right or wrong strategy here I believe but we should note that the strategy suggested above means that predict(X) called twice on the same X may lead to different predictions. Do we have other predictors in scikit-learn that are inherently random (excluding tie-breaking etc.)?

github-actions bot added the Needs Triage Issue requires triage label Apr 4, 2024

betatim added Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

adam2392 commented Apr 4, 2024 •

edited

thomasjpfan commented Apr 4, 2024

adam2392 commented Apr 8, 2024

glemaitre commented Apr 8, 2024

adam2392 commented Apr 8, 2024

betatim commented Apr 9, 2024

glemaitre commented Apr 9, 2024

adam2392 commented Apr 10, 2024

NicolasHug commented Apr 10, 2024 •

edited

BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772

Comments

adam2392 commented Apr 4, 2024 • edited

Proposed Solution

Related

thomasjpfan commented Apr 4, 2024

adam2392 commented Apr 8, 2024

glemaitre commented Apr 8, 2024

adam2392 commented Apr 8, 2024

betatim commented Apr 9, 2024

glemaitre commented Apr 9, 2024

adam2392 commented Apr 10, 2024

NicolasHug commented Apr 10, 2024 • edited

adam2392 commented Apr 4, 2024 •

edited

NicolasHug commented Apr 10, 2024 •

edited