Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we tune only colsample_bynode instead of multiple colsample_by* #5988

Closed
QuantHao opened this issue Aug 6, 2020 · 2 comments
Closed

Comments

@QuantHao
Copy link

QuantHao commented Aug 6, 2020

Hi, all

xgboost provides us three colsample_by* : colsample_bytree, colsample_bylevel, colsample_bynode and saying that "colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split."

Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??

Also, random forest tutorial seems to support my argument above because it also only uses colsample_bynode. Isn't colsample_by* learnt from random forest?

Thanks for any help.

@trivialfis
Copy link
Member

trivialfis commented Aug 6, 2020

Thus, it seems that users should directly tune colsample_bynode instead of all three, in order to reduce tuning search space. What's drawback of doing this??

These are really just using randomness to combat the bias vs variance. At the end it's about how much randomness do you want. If you use bytree, then each tree will have a fixed set of splitting features. On the other hand if you use bynode, the model might be more general, but it would also be very difficult to interpret. For instance, if we want to change a default random seed or sampling algorithm for some reason(I tried not to do that in #5962 , but it may happen in the future) , you will get a complete different model explanation from shap the next time you train the model with exactly the same dataset. So it's a trade-off you have to decide.

@QuantHao
Copy link
Author

QuantHao commented Aug 7, 2020

Thanks so much for your reply. It seems that using colsample_bynode could prevent overfitting better, while colsample_bytree provides better stability for SHAP explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants