Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slicing tree model #6302

Merged
merged 17 commits into from Nov 3, 2020
Merged

Support slicing tree model #6302

merged 17 commits into from Nov 3, 2020

Conversation

trivialfis
Copy link
Member

This PR is meant the end the confusion around best_ntree_limit and unify model slicing. We have multi-class and random forests, asking users to understand how to set ntree_limit is difficult and error prone.

Close #5531 Close #4052.

  • Implement the save_best option in early stopping.
  • Negative index is not supported.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 28, 2020

Currently I treat out of bound error specially and raise IndexError but ValueError for other issues. Should I always raise IndexError instead? Feel free to provide suggestions . ;-)

@trivialfis trivialfis mentioned this pull request Oct 28, 2020
14 tasks
@hcho3 hcho3 added the Blocking label Oct 30, 2020
@codecov-io
Copy link

codecov-io commented Oct 30, 2020

Codecov Report

Merging #6302 into master will increase coverage by 0.56%.
The diff coverage is 96.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6302      +/-   ##
==========================================
+ Coverage   80.75%   81.32%   +0.56%     
==========================================
  Files          12       12              
  Lines        3372     3421      +49     
==========================================
+ Hits         2723     2782      +59     
+ Misses        649      639      -10     
Impacted Files Coverage Δ
python-package/xgboost/training.py 96.49% <33.33%> (+0.03%) ⬆️
python-package/xgboost/callback.py 92.98% <100.00%> (+0.37%) ⬆️
python-package/xgboost/core.py 80.00% <100.00%> (+1.65%) ⬆️
python-package/xgboost/tracker.py 95.18% <0.00%> (+1.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 608bda7...60368c8. Read the comment docs.

@trivialfis
Copy link
Member Author

Pasting offline conversion with @hcho3 here. The trees in xgboost can be considered as a 3 dim tensor. First dim is the number of boosting round, second is number of classes, and the last is size of forest. This PR supports only slicing the first dim (number of boosted rounds), it's possible to support slicing other dimensions. but to us that seems to be over engineering.

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great new feature. Any plans to deprecate ntree_limit throughout the code base in favour of the new terminology?

@trivialfis
Copy link
Member Author

There are other language bindings out there. I need to go over them to deprecate the parameter.

dtrain = xgb.DMatrix(data=X, label=y)
num_parallel_tree = 4
num_boost_round = 16
total_trees = num_parallel_tree * num_classes * num_boost_round
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used anywhere in the code snippet.

Suggested change
total_trees = num_parallel_tree * num_classes * num_boost_round

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted into a comment.

Comment on lines 584 to 590
* \brief Slice a model according to layers.
*
* \param handle Booster to be sliced.
* \param begin_layer start of the slice
* \param end_layer end of the slice
* \param step step size of the slice
* \param out Sliced booster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \brief Slice a model according to layers.
*
* \param handle Booster to be sliced.
* \param begin_layer start of the slice
* \param end_layer end of the slice
* \param step step size of the slice
* \param out Sliced booster.
* \brief Slice a model using boosting index. The slice m:n indicates taking all trees
* that were fit during the boosting rounds m, (m+1), (m+2), ..., (n-1).
*
* \param handle Booster to be sliced.
* \param begin_layer start of the slice
* \param end_layer end of the slice; end_layer=0 is equivalent to
* end_layer=num_boost_round
* \param step step size of the slice
* \param out Sliced booster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

Comment on lines 63 to 68
/*!
* \brief Slice the model.
* \param layer_begin Begining of boosted tree layer used for prediction.
* \param layer_end End of booster layer. 0 means do not limit trees.
* \param out Output gradient booster
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*!
* \brief Slice the model.
* \param layer_begin Begining of boosted tree layer used for prediction.
* \param layer_end End of booster layer. 0 means do not limit trees.
* \param out Output gradient booster
*/
/*!
* \brief Slice a model using boosting index. The slice m:n indicates taking all trees
* that were fit during the boosting rounds m, (m+1), (m+2), ..., (n-1).
* \param layer_begin Begining of boosted tree layer used for prediction.
* \param layer_end End of booster layer. 0 means do not limit trees.
* \param out Output gradient booster
*/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

Comment on lines 391 to 393
def test_slice(self):
self.run_slice('gbtree')
self.run_slice('dart')
Copy link
Collaborator

@hcho3 hcho3 Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use @pytest.mark.parameterize instead?

@pytest.mark.parameterize(booster, ['gbtree', 'dart'])
def test_slice(self, booster):
    # Body of test

See examples at Parameterizing tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not being compatible with class method:

TypeError: test_slice() missing 1 required positional argument: 'booster'

Copy link
Collaborator

@hcho3 hcho3 Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try @pytest.mark.parameterize('booster', ['gbtree', 'dart']) (note the quotes around booster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Also, TestModels should not be a subclass of unittest.TestCase. https://stackoverflow.com/a/35562401. Try making the class a subclass of object:

class TestModels(object):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, done.

layer_begin, layer_end, step, this->model_, tparam_, layer_trees,
[&](auto const &in_it, auto const &out_it) {
auto new_tree =
std::make_unique<RegTree>(*this->model_.trees.at(in_it));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have assurance that the implicitly generated copy constructor RegTree(const RegTree&) behaves correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests with prediction.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 3, 2020

The parametrization looks quite nice. The benefits are as follows:

  • If a parametrized test fails for only one of the multiple parameters, the test result will clearly indicate which parameter led to failure.
  • It allows for more compact code, as we no longer have to repeat the same code to test multiple scenarios.

I'd like to submit a follow-up PR to introduce more test parametrization where it's appropriate. For example, the following snippet can be made more compact using a test parametrization:

def test_empty_dmatrix_hist():
with LocalCluster(n_workers=kWorkers) as cluster:
with Client(cluster) as client:
parameters = {'tree_method': 'hist'}
run_empty_dmatrix_reg(client, parameters)
run_empty_dmatrix_cls(client, parameters)
def test_empty_dmatrix_approx():
with LocalCluster(n_workers=kWorkers) as cluster:
with Client(cluster) as client:
parameters = {'tree_method': 'approx'}
run_empty_dmatrix_reg(client, parameters)
run_empty_dmatrix_cls(client, parameters)

@hcho3 hcho3 merged commit 2cc9662 into dmlc:master Nov 3, 2020
@trivialfis trivialfis deleted the slice-model branch November 3, 2020 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removing trees from a model [feature proposal] Load best model after early stopping
4 participants