Should we tune only colsample_bynode instead of multiple colsample_by* #5988

QuantHao · 2020-08-06T02:39:28Z

Hi, all

xgboost provides us three colsample_by* : colsample_bytree, colsample_bylevel, colsample_bynode and saying that "colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split."

Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??

Also, random forest tutorial seems to support my argument above because it also only uses colsample_bynode. Isn't colsample_by* learnt from random forest?

Thanks for any help.

trivialfis · 2020-08-06T13:55:11Z

Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??

These are really just using randomness to combat the bias vs variance. At the end it's about how much randomness do you want. If you use bytree, then each tree will have a fixed set of splitting features. On the other hand if you use bynode, the model might be more general, but it would also be very difficult to interpret. For instance, if we want to change a default random seed or sampling algorithm for some reason(I tried not to do that in #5962 , but it may happen in the future) , you will get a complete different model explanation from shap the next time you train the model with exactly the same dataset. So it's a trade-off you have to decide.

QuantHao · 2020-08-07T01:18:01Z

Thanks so much for your reply. It seems that using colsample_bynode could prevent overfitting better, while colsample_bytree provides better stability for SHAP explanation.

trivialfis closed this as completed Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we tune only colsample_bynode instead of multiple colsample_by* #5988

Should we tune only colsample_bynode instead of multiple colsample_by* #5988

QuantHao commented Aug 6, 2020

trivialfis commented Aug 6, 2020 •

edited

QuantHao commented Aug 7, 2020

Should we tune only colsample_bynode instead of multiple colsample_by* #5988

Should we tune only colsample_bynode instead of multiple colsample_by* #5988

Comments

QuantHao commented Aug 6, 2020

trivialfis commented Aug 6, 2020 • edited

QuantHao commented Aug 7, 2020

trivialfis commented Aug 6, 2020 •

edited