New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples #28772
Comments
For random forest, I used the same behavior in HistGradientBoosting, which adopted the behavior from LightGBM. @NicolasHug Do you have more context on the missing value behavior for HistGradientBoosting? |
If it helps the discussion, I found this old issue in LightGBM, which seems to reflect their docs (I'm unsure cuz I can't find a specific line mentioning how they exactly treat this edge case). To summarize, it seems the old issue stated they replaced
Lmk if there's some analysis I'm missing wrt LightGBM on perhaps why this behavior is the way it is. |
Is there any overhead by having the stochastic process? For instance, in case of tie breaking in nearest-neighbors, we did not use a stochastic process because it would be detrimental in terms of computing performance with a huge regression. |
I imagine there is some overhead due to the potential to query a RNG many times. Tho this would potentially be also an edge case since it implies there are many missing-values seen during test that were not seen during training. I.e. the only times we would stochastic process left/right is when a specific feature is used that does not have missing values encountered for that specific tree's node depth in the training data. I personally think this edge-case is less worrisome than the bias point I raised, but it's plausible to occur in practice of course. I think the situation you describe is a good one tho and makes the point that the situation may be more complex. It is presumable this could occur in practice tho. Some ideas:
|
If it is easy to implement the "make a weighted random choice" option (what you propose) then we could run a benchmark to see what the performance hit is for different cases (many missing values in many features, many missing values in a few features, few missing values in few features, etc). That would give us a way to decide if performance is/isn't something to worry about when making a decision. |
In terms of an application use case, I'm also wondering if we should not error/warn if a user starts to provide missing values at test time while we did not see any during training. @adam2392 do you have a legitimate use case where this is kind of normal. |
I think a warning would be nice but an error message might be overkill because when a NaN pops up in the testing dataset (but not training), ideally our use case is we would like to ignore its effect. If we errored out, we would need to fill in the NaN value or drop the sample altogether. OTOH, the issue rn is that by sending it down the child with more samples, it just biases the NaN towards a class (in classification). I also see a case where maybe one would like the current behavior. It all boils down to the assumption on why the NaNs show up. Ideally, both settings should be supported if it's not too complex. I can do a quick implementation and try it out on a benchmark tho before continuing the discussion? |
I don't think we should raise a warning, unless we can identify clear cases where missing values during test time is clearly unexpected. There's no right or wrong strategy here I believe but we should note that the strategy suggested above means that |
Currently, when missing-values occur only in the testing dataset for constructing a RandomForest, there is a policy that the missing values are sent to the child with the most samples. This amounts to in some sense imputing the missing-value data using the data in the child with the most samples. An issue here is that this may bias the tree prediction towards say a class in the training dataset with more samples.
For example, say there are 1000 training samples of class 1 and 10 training samples of class 0, and then during test time there are some NaNs. The predictions would then bias towards class 1, whereas it should really be uninformative because the NaNs during test time are treated as missing completely at random.
Proposed Solution
However, an alternative and more sensible strategy is that when NaNs are not enountered during training, but show up in testing data, they should just be sent stochastically down the tree using weights:
p_left_child = n_left_samples / (n_left_samples + n_right_samples)
p_right_child = n_right_samples / (n_left_samples + n_right_samples)
This ensures that there is no bias towards the class "with more samples". This can be implemented by allowing the value of
missing_go_to_left
(scikit-learn/sklearn/tree/_splitter.pxd
Line 28 in 6bf0ba5
2
. If the value is2
, it implies that missing-values were not observed during training time, and thus should be stochastically set.Overall, it's a very simple change, and I can also implement relevant unit-tests.
cc: @thomasjpfan who implemented the original missing-value support in RandomForest.
Related
xref: This policy will also impact #27966 and #28268
This is also an issue in other estimators that handle NaNs: https://scikit-learn.org/stable/modules/ensemble.html#missing-values-support
The text was updated successfully, but these errors were encountered: