You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xgboost provides us three colsample_by* : colsample_bytree, colsample_bylevel, colsample_bynode and saying that "colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split."
Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??
Also, random forest tutorial seems to support my argument above because it also only uses colsample_bynode. Isn't colsample_by* learnt from random forest?
Thanks for any help.
The text was updated successfully, but these errors were encountered:
Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??
These are really just using randomness to combat the bias vs variance. At the end it's about how much randomness do you want. If you use bytree, then each tree will have a fixed set of splitting features. On the other hand if you use bynode, the model might be more general, but it would also be very difficult to interpret. For instance, if we want to change a default random seed or sampling algorithm for some reason(I tried not to do that in #5962 , but it may happen in the future) , you will get a complete different model explanation from shap the next time you train the model with exactly the same dataset. So it's a trade-off you have to decide.
Thanks so much for your reply. It seems that using colsample_bynode could prevent overfitting better, while colsample_bytree provides better stability for SHAP explanation.
Hi, all
xgboost provides us three colsample_by* : colsample_bytree, colsample_bylevel, colsample_bynode and saying that "colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split."
Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??
Also, random forest tutorial seems to support my argument above because it also only uses colsample_bynode. Isn't colsample_by* learnt from random forest?
Thanks for any help.
The text was updated successfully, but these errors were encountered: