New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition optimization #7208
base: master
Are you sure you want to change the base?
Partition optimization #7208
Conversation
bb7253a
to
137183e
Compare
Codecov Report
@@ Coverage Diff @@
## master #7208 +/- ##
=======================================
Coverage 82.67% 82.67%
=======================================
Files 13 13
Lines 4017 4017
=======================================
Hits 3321 3321
Misses 696 696
Continue to review full report at Codecov.
|
@trivialfis Can we start detailed review of this PR as this part is related to partition optimizations only? |
check_hist(parent_hist, this_hist, sibling_hist, 0, total_bins); | ||
++node_id; | ||
} | ||
// const common::OptPartitionBuilder* p_opt_partition_builder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a sign of working in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my mistake, sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests were aligned to optimized partition kernel usage
137183e
to
e932247
Compare
@trivialfis failed steps |
Yup, I'm aware of the failure. It's blocking the release. |
@trivialfis - are we considering this PR to be integrated prior to release? |
No, I don't think it's a good idea to integrate this amount of changes. I looked briefly before and I think it can use some cleanups. I'm on holiday right now. Will provide detailed review and possible changes once I come back. |
@trivialfis - ok, thanks. Have good holidays! |
a2a6ccd
to
fd8244c
Compare
similar fails in master branch here: |
Restarted. @hcho3 The issue seems to be lingering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I appreciate your excellent optimization. But I do think there's a way to write more readable code.
The review is not in detail. A new partitioner that implements std::partition
spans 550 LOC without any comment and with 29 public variables is a bit too much for me. Also, the old one seems to be still here for some reason. I can't merge a PR with this level of complexity without seriously thinking about a redesign/rewrite. There are some good reading materials in c++ core guidelines and are highly recommended:
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#main
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#S-performance
@@ -115,7 +115,7 @@ class XGBoostGeneralSuite extends FunSuite with TmpFolderPerSuite with PerTest { | |||
val eval = new EvalError() | |||
val training = buildDataFrame(Classification.train) | |||
val testDM = new DMatrix(Classification.test.iterator) | |||
val paramMap = Map("eta" -> "1", "gamma" -> "0.5", "max_depth" -> "0", | |||
val paramMap = Map("eta" -> "1", "gamma" -> "0.5", "max_depth" -> "6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted back
src/common/opt_partition_builder.h
Outdated
@@ -0,0 +1,552 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the original one deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, deleted currently, test for partition was updated
src/common/opt_partition_builder.h
Outdated
dynamic_cast<const SparseColumn<BinIdxType>*>(GetColumnsRef<BinIdxType>()[fid].get()); | ||
} | ||
for (size_t tid = 0; tid < nthreads; ++tid) { | ||
states[tid].resize(1 << (max_depth + 1), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a MaxNodes
function in TreeParam
.
src/common/opt_partition_builder.h
Outdated
if (is_loss_guided) { | ||
const int cleft = compleate_trees_depth_wise[0]; | ||
const int cright = compleate_trees_depth_wise[1]; | ||
// template! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, fixed
src/tree/updater_quantile_hist.h
Outdated
std::vector<uint16_t> curr_level_nodes_; | ||
std::vector<int32_t> split_conditions_; | ||
std::vector<uint64_t> split_ind_; | ||
std::vector<uint16_t> compleate_trees_depth_wise_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compleate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
src/common/opt_partition_builder.h
Outdated
std::vector<std::vector<Slice>> threads_addr; | ||
std::vector<std::vector<uint16_t>> threads_id_for_nodes; | ||
std::vector<std::vector<uint16_t>> node_id_for_threads; | ||
std::vector<std::vector<uint32_t>> threads_rows_nodes_wise; | ||
std::vector<std::vector<uint32_t>> threads_nodes_count; | ||
std::vector<std::vector<int>> nodes_count; | ||
std::vector<Slice> partitions; | ||
std::vector<std::vector<uint32_t>> vec_rows; | ||
std::vector<std::vector<uint32_t>> vec_rows_remain; | ||
std::vector<std::unique_ptr<const Column<uint8_t> >> columns8; | ||
std::vector<const DenseColumn<uint8_t, true>*> dcolumns8; | ||
std::vector<const SparseColumn<uint8_t>*> scolumns8; | ||
std::vector<std::unique_ptr<const Column<uint16_t> >> columns16; | ||
std::vector<const DenseColumn<uint16_t, true>*> dcolumns16; | ||
std::vector<const SparseColumn<uint16_t>*> scolumns16; | ||
std::vector<std::unique_ptr<const Column<uint32_t> >> columns32; | ||
std::vector<const DenseColumn<uint32_t, true>*> dcolumns32; | ||
std::vector<const SparseColumn<uint32_t>*> scolumns32; | ||
std::vector<std::vector<size_t> > states; | ||
const RegTree* p_tree; | ||
// can be common for all threads! | ||
std::vector<std::vector<bool>> default_flags; | ||
const uint8_t* data_hash; | ||
std::vector<uint32_t> row_set_collection_vec; | ||
uint32_t gmat_n_rows; | ||
uint32_t* row_indices_ptr; | ||
size_t n_threads = 0; | ||
uint32_t summ_size = 0; | ||
uint32_t summ_size_remain = 0; | ||
uint32_t max_depth = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
29 public variables ... how do you keep track of that ...
src/common/opt_partition_builder.h
Outdated
const auto& dense_columns = GetDenseColumnsRef<BinIdxType>(); | ||
const auto& sparse_columns = GetSparseColumnsRef<BinIdxType>(); | ||
uint32_t count = 0; | ||
uint32_t count2 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp_0, temp_1 is not a good way to name variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, thanks!
const RowSetCollection::Elem e = row_set_collection_[nid]; | ||
for (const size_t *it = e.begin; it < e.end; ++it) { | ||
grad_stat.Add(gpair_h[*it].GetGrad(), gpair_h[*it].GetHess()); | ||
for (const GradientPair& gh : gpair_h) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe std::accumulate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will affect the precision, otherwise we need to support new operator+(xgboost::detail::GradientPairInternal<double>, const xgboost::detail::GradientPairInternal<float>)
src/tree/updater_quantile_hist.cc
Outdated
for (auto const& entry : expand) { | ||
if (entry.IsValid(param_, *num_leaves)) { | ||
nodes_for_apply_split->push_back(entry); | ||
evaluator_->ApplyTreeSplit(entry, p_tree); | ||
(*num_leaves)++; | ||
curr_level_nodes_[2*entry.nid] = (*p_tree)[entry.nid].LeftChild(); | ||
curr_level_nodes_[2*entry.nid + 1] = (*p_tree)[entry.nid].RightChild(); | ||
compleate_node_ids.push_back((*p_tree)[entry.nid].LeftChild()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compleate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed thanks!
src/common/opt_partition_builder.h
Outdated
} | ||
|
||
template<typename BinIdxType, bool is_loss_guided, bool all_dense = true> | ||
void CommonPartition(size_t tid, const size_t row_indices_begin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/common/opt_partition_builder.h
Outdated
std::vector<std::unique_ptr<const Column<uint32_t> >> columns32; | ||
std::vector<const DenseColumn<uint32_t, true>*> dcolumns32; | ||
std::vector<const SparseColumn<uint32_t>*> scolumns32; | ||
std::vector<std::vector<size_t> > states; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/tree/updater_quantile_hist.cc
Outdated
has_neg_hess = true; | ||
} | ||
} | ||
const size_t block_size = info.num_row_ / this->nthread_ + !!(info.num_row_ % this->nthread_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block size calculation is used in multiple places, including prediction I think? Any plan to create some abstraction for it?
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rp-library
std::swap
is a small function, yet it's a worthy function to create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new function GetBlockSize
and tried to use it in all similar cases. Thanks!
src/common/opt_partition_builder.h
Outdated
vec_rows_remain.resize(nthreads); | ||
} else { | ||
threads_nodes_count.clear(); | ||
threads_nodes_count.resize(n_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why clear then resize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to clear full threads_nodes_cout
buffers. If keep only resize(), information in threads_nodes_cout
is not valid.
fd8244c
to
2e08a58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis thanks a lot for review and sorry for long response!
I applied part of comments but related to code complexity, number of input arguments, and reworking in general are still in progress.
src/common/opt_partition_builder.h
Outdated
const auto& dense_columns = GetDenseColumnsRef<BinIdxType>(); | ||
const auto& sparse_columns = GetSparseColumnsRef<BinIdxType>(); | ||
uint32_t count = 0; | ||
uint32_t count2 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, thanks!
src/common/opt_partition_builder.h
Outdated
if (is_loss_guided) { | ||
const int cleft = compleate_trees_depth_wise[0]; | ||
const int cright = compleate_trees_depth_wise[1]; | ||
// template! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, fixed
const RowSetCollection::Elem e = row_set_collection_[nid]; | ||
for (const size_t *it = e.begin; it < e.end; ++it) { | ||
grad_stat.Add(gpair_h[*it].GetGrad(), gpair_h[*it].GetHess()); | ||
for (const GradientPair& gh : gpair_h) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will affect the precision, otherwise we need to support new operator+(xgboost::detail::GradientPairInternal<double>, const xgboost::detail::GradientPairInternal<float>)
src/tree/updater_quantile_hist.cc
Outdated
for (auto const& entry : expand) { | ||
if (entry.IsValid(param_, *num_leaves)) { | ||
nodes_for_apply_split->push_back(entry); | ||
evaluator_->ApplyTreeSplit(entry, p_tree); | ||
(*num_leaves)++; | ||
curr_level_nodes_[2*entry.nid] = (*p_tree)[entry.nid].LeftChild(); | ||
curr_level_nodes_[2*entry.nid + 1] = (*p_tree)[entry.nid].RightChild(); | ||
compleate_node_ids.push_back((*p_tree)[entry.nid].LeftChild()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed thanks!
src/tree/updater_quantile_hist.h
Outdated
std::vector<uint16_t> curr_level_nodes_; | ||
std::vector<int32_t> split_conditions_; | ||
std::vector<uint64_t> split_ind_; | ||
std::vector<uint16_t> compleate_trees_depth_wise_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
src/tree/updater_quantile_hist.cc
Outdated
has_neg_hess = true; | ||
} | ||
} | ||
const size_t block_size = info.num_row_ / this->nthread_ + !!(info.num_row_ % this->nthread_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new function GetBlockSize
and tried to use it in all similar cases. Thanks!
src/common/opt_partition_builder.h
Outdated
@@ -0,0 +1,552 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, deleted currently, test for partition was updated
src/common/opt_partition_builder.h
Outdated
vec_rows_remain.resize(nthreads); | ||
} else { | ||
threads_nodes_count.clear(); | ||
threads_nodes_count.resize(n_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to clear full threads_nodes_cout
buffers. If keep only resize(), information in threads_nodes_cout
is not valid.
eeb7a88
to
8ed8a91
Compare
Got it. For some reason, I can't reply inline. I will take a deeper dive into the optimization and see if we can work on smaller subsets first. Thanks for the hard work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I took another look, but only at the histogram building kernel this time. Can we start by working on this function first? Merging smaller functions makes review and test easier, also has the benefit of decoupling different code sections. What do you think?
const size_t nrows = row_end - row_begin; | ||
const size_t no_prefetch_size = Prefetch::NoPrefetchSize(nrows); | ||
|
||
if (is_root) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask what's the difference between the is_root
and existing contiguousBlock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contiguous block is mostly possible on the root node and it's really rare case when nodes with depth > 0 have contiguous blocks so there is no need to handle such cases specifically.
Also is_root
parameter handles the reset to zero node_ids
buffer. As described above we need to track the node ids for each row (rows are not grouped for each node) so we have to reset nod_ids buffer to zero at the start of each tree building
const uint32_t row_end, | ||
const GHistIndexMatrix& gmat, | ||
const BinIdxType* numa_data, | ||
uint16_t* nodes_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share why do we need to pass nodes_id for building histogram instead of passing the histogram buffer for this particular node? This way we can have a small scope for this function and makes it easier to test.
Also, I think the node should have 32-bit type bst_node_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Changes in hist_builder.h
were introduced due to changes in partition builder.
Before optimizations "full" bin matrix went through the memory for each node on the same level during histogram building: potentially for each node on specific level we have to read the bin matrix almost from the beginning to the end with random strides. The bandwidth of the memory is significantly dropped in such conditions.
So to mitigate this problem in optimized code we have "sorted" row indecies (but no sorting is done exactly! :) ) for each specific level and as a side effect (rows are not grouped by nodes) we have to know what node_id
specific row is belong to. node_id
is required to know what node's histogram should be updated with specific bin matrix row.
Sorry if description is not clear, please let me know and I'll try to add more details.
As for 32-bit type usage, now it works for max_depth <= 16 perfectly. I have to check the performance impact before changing it from 16 to 32 bit and I hope there won't be any reason to add template parameter for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation! That is an insightful optimization.
For future references with these types of optimization, I think you should put that on the front page of PR before anything else and in the code comment, and potentially open an RFC like #7308 before making the change, so we can be clear about the design choice. (I write design doc from time to time just to keep myself clear about the changes ...). For this PR, I asked myself again, there's no way I can guess what you are trying to do by looking at the code. ;-)
So, let me try to summarize your summary ;-) Feel free to correct me if I'm wrong. Instead of asking which row should we process given a node index, you ask which node should we process given a row index. This optimization is only applicable when the depth-wise grow policy is used, so the code is split and specialized.
Can you design an intuitive interface for the partitioner and the histogram builder so that the code can be easily reused? I can give it some thoughts if you can confirm my summary.
As for 32-bit type usage, now it works for max_depth <= 16 perfectly.
I don't think this optimization is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of asking which row should we process given a node index, you ask which node should we process given a row index. This optimization is only applicable when the depth-wise grow policy is used, so the code is split and specialized.
It's clear and right summary :)
Yes. I had to specialize the partition builder related code for depth-wise and loss-guided grow polices, as for loss-guided policy we have to know exactly the set of rows splited to left and to right child nodes.
Can you design an intuitive interface for the partitioner and the histogram builder so that the code can be easily reused? I can give it some thoughts if you can confirm my summary.
Would it be applicable if the OptPartitionBuilder
class supported the depthwise
and lossguided
policies (strategy pattern) instead of is_loss_guided
template usage?
As for histogram builder, there is no significant changes in BuildHistKernel
, only node_id
usage which is applicable for both growing policies.
const size_t icol_end = any_missing ? row_ptr[ri+1] : icol_start + n_features; | ||
const size_t row_size = icol_end - icol_start; | ||
const size_t idx_gh = two * ri; | ||
const uint32_t nid = is_root ? 0 : mapping_ids[nodes_ids[ri]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me what is this mapping doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simple mapping of the real node id (in the currently build tree) to node number on current depth (0 <= nid < 2^current_depth
). This mapping allows to re-use allocated histograms.
template<typename FPType, bool do_prefetch, | ||
typename BinIdxType, bool is_root, | ||
bool any_missing> | ||
void BuildHistKernel(const std::vector<GradientPair>& gpair, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent that it's now in a header so we don't have to define those specializations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, seems some simplification was added :)
a5ec1d6
to
fecf89d
Compare
@trivialfis Thanks a lot for provided review! |
src/common/opt_partition_builder.h
Outdated
@@ -204,6 +204,13 @@ class OptPartitionBuilder { | |||
if (!all_dense) { | |||
std::vector<size_t>& local_states = states[tid]; | |||
std::vector<bool>& local_default_flags = default_flags[tid]; | |||
if (row_indices_begin >= row_indices_end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I saw that you have pushed a fix, presumably found during some local testing. But really, there's no way I can guess what you are trying to fix. These few lines look like a hacky workaround to some edge cases. It's fine, I do that all the time during development. But after the trial and error period, can we think of a more general design that doesn't have these conditions at all? Or make some cleanup so that the code can explain itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this fix. Thanks for comment!
Seems, currently it doesn't look like some hacky workaround ;)
And I agree that clear and more general design should be proposed for OptPartitionBuilder
, possibly with strategy patter applying.
I'm happy to schedule a call if needed. I think we have some different points of view and would be great to share more. |
906d5a9
to
d155edc
Compare
bf0705f
to
b327799
Compare
…ation_part_applysplit
b327799
to
3b08089
Compare
All changes in this PR are related to each other: new partition builder required to introduce changes into histogram building kernels, and each part doesn't work separetly. |
…ation_part_applysplit
5307902
to
efb4f50
Compare
…ation_part_applysplit
d41893c
to
2525030
Compare
…Matrix are implemented.
Hello, everyone. |
3ccb329
to
7780172
Compare
…ation_part_applysplit
32aef0d
to
9227502
Compare
it's part of #7192
Current implementation of partition allow to read bin matrix data with less random access