-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization/apply split/column matrix #7991
base: master
Are you sure you want to change the base?
Optimization/apply split/column matrix #7991
Conversation
We are trying to unblock the CI, will keep you posted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the long wait. I have restored the CI, please rebase to master branch. It seems the PR is reverting some of the latest refactorings. Could you please provide a brief description on what the PR does and why do you need to make those changes?
@@ -66,10 +65,12 @@ class PartitionBuilder { | |||
auto p_row_indices = row_indices.data(); | |||
auto n_samples = row_indices.size(); | |||
|
|||
const uint32_t first_row_id = n_samples > 0 ? p_row_indices[0] : 0; | |||
size_t state = column->GetInitialState(first_row_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hid the state inside the column and made the column an object similar to an iterator. Can we keep that small abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a part of big optimization. In another parts partition_builder.h will be fully rewritten (see here for a reference).Thus I think we can not to pay much attention to changes in the current realization in partition_builder.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I saw some new changes in the original PR, will look into it Monday.
@@ -60,16 +61,17 @@ GHistIndexMatrix::GHistIndexMatrix(DMatrix *p_fmat, bst_bin_t max_bins_per_feat, | |||
GHistIndexMatrix::~GHistIndexMatrix() = default; | |||
|
|||
void GHistIndexMatrix::PushBatch(SparsePage const &batch, common::Span<FeatureType const> ft, | |||
bst_bin_t n_total_bins, int32_t n_threads) { | |||
size_t rbegin, bst_bin_t n_total_bins, int32_t n_threads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rbegin
is no longer necessary
} | ||
|
||
// construct column matrix from GHistIndexMatrix | ||
inline void Init(SparsePage const& page, const GHistIndexMatrix& gmat, double sparse_threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Batch
is for preparing the column matrix to use adapters https://github.com/dmlc/xgboost/blob/master/src/data/adapter.h . We want to avoid SparsePage as it might consume 3 times more memory than the original data.
Dear @trivialfis , This block of changes is the part of the optimization proposed in #7208. @ShvetsKS is busy in other projects, so I will help to adapt the code to the acceptable view. One of the main problem with the original PR #7208 is high amount of changing code in one PR. For this reason I separated one significant block of logic in this PR for simplification of reviewing and more easy maintaining. After this PR I plan to refactor the optimized partition builder from the original PR #7208. |
Apologies for the confusion and delay. I appreciate the effort in making smaller PRs. Would you like to have a chat over zoom or any other platforms? I'm a little bit confused by the changes. |
Dear @trivialfis , |
@razdoburdin Thank you, please add @RAMitchell as well. |
I sent the invitation to the email from your github profiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really look like changes related to anything, as the new methods are not used.
If I was looking to break up the original PR into parts, I would start by modifying interfaces. For example the current partition builder has an interface designed to perform partitioning for a single node - it needs to change to accept multiple nodes. Along with this you could start moving some of the code here
xgboost/src/tree/updater_quantile_hist.h
Line 154 in 2365f82
common::ParallelFor2d(space, ctx->Threads(), [&](size_t node_in_set, common::Range1d r) { |
Once the interfaces are changed it becomes much easier to change the internals.
const size_t* p = std::lower_bound(row_data, row_data + column_size, first_row_idx); | ||
// column_size if all missing | ||
idx_ = p - row_data; | ||
if (!((*state) < column_size)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps if (*state >= column_size) is easier to read?
* TODO: Delete after the optimizied partition builder will be implemented | ||
*/ | ||
template <typename CastType = Column::BinCmpType> | ||
CastType GetBinId(size_t idx, size_t* state) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only difference here is having one less template parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method only needs for maintaining the compatibility with the current partition builder. When the optimized partition builder from the main PR is implemented, this method will be deleted.
// get number of features | ||
bst_feature_t GetNumFeature() const { return static_cast<bst_feature_t>(type_.size()); } | ||
// get index data ptr | ||
template <typename Data> | ||
const Data* GetIndexData() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this function used anywhere. Ideally we would not touch the internals of the class in any case.
auto batch = data::SparsePageAdapterBatch{page.GetView()}; | ||
this->Init(batch, std::numeric_limits<float>::quiet_NaN(), gmat, sparse_threshold, n_threads); | ||
} | ||
const ColumnListType& GetColumnList() const { return column_list_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to return the whole vector? Why not just GetColumn(feature_idx)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in opt_partition_builder.h (from main PR) we need to access many columns many times -> therefore a reference to the whole vector is returned to avoid additional perf cost (increment, decrement) caused by returning shared_ptr for each column.
auto const nfeature = static_cast<bst_feature_t>(gmat.cut.Ptrs().size() - 1); | ||
const size_t nrow = gmat.row_ptr.size() - 1; | ||
// get index data ptr | ||
const ByteType* GetIndexData() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
// first_row_id is the first row in the leaf partition | ||
const size_t* row_data = RowIndices(); | ||
template <typename BinIdxType, typename CastType = Column::BinCmpType> | ||
CastType GetBinIdx(size_t rid, size_t* state) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method used anywhere in this PR?
I think we really need more information on what state
is and why it's necessary e.g. It seems to be used to speed up the search for rows (only in the case of sparse data) by keeping track of rose already seen. Also, the function GetBinIdx() would not work if the rows were not accessed in sorted order - this would be a trap for someone trying to use this code in future, so a comment would help. typename BinIdxType
looks unused.
In my opinion this PR should not be merged. It looks like unrelated pieces of code copy pasted from the original PR. It's definitely possible to implement the desired improvements by gradual refactor, where each part improves existing code and makes sense. Other contributors have been able to do this for very complicated changes, even with most PRs under 200 lines. |
This is a part of #7208 .
I tried to separate changes into the independent pull-requests for simplification of the reviewing. In the current one only the changes in ColumnMatrix are implemented.