Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization/apply split/column matrix #7991

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

razdoburdin
Copy link
Contributor

This is a part of #7208 .
I tried to separate changes into the independent pull-requests for simplification of the reviewing. In the current one only the changes in ColumnMatrix are implemented.

@trivialfis
Copy link
Member

We are trying to unblock the CI, will keep you posted.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the long wait. I have restored the CI, please rebase to master branch. It seems the PR is reverting some of the latest refactorings. Could you please provide a brief description on what the PR does and why do you need to make those changes?

@@ -66,10 +65,12 @@ class PartitionBuilder {
auto p_row_indices = row_indices.data();
auto n_samples = row_indices.size();

const uint32_t first_row_id = n_samples > 0 ? p_row_indices[0] : 0;
size_t state = column->GetInitialState(first_row_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hid the state inside the column and made the column an object similar to an iterator. Can we keep that small abstraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a part of big optimization. In another parts partition_builder.h will be fully rewritten (see here for a reference).Thus I think we can not to pay much attention to changes in the current realization in partition_builder.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I saw some new changes in the original PR, will look into it Monday.

@@ -60,16 +61,17 @@ GHistIndexMatrix::GHistIndexMatrix(DMatrix *p_fmat, bst_bin_t max_bins_per_feat,
GHistIndexMatrix::~GHistIndexMatrix() = default;

void GHistIndexMatrix::PushBatch(SparsePage const &batch, common::Span<FeatureType const> ft,
bst_bin_t n_total_bins, int32_t n_threads) {
size_t rbegin, bst_bin_t n_total_bins, int32_t n_threads) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rbegin is no longer necessary

}

// construct column matrix from GHistIndexMatrix
inline void Init(SparsePage const& page, const GHistIndexMatrix& gmat, double sparse_threshold,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Batch is for preparing the column matrix to use adapters https://github.com/dmlc/xgboost/blob/master/src/data/adapter.h . We want to avoid SparsePage as it might consume 3 times more memory than the original data.

@razdoburdin
Copy link
Contributor Author

Dear @trivialfis ,
thanks for the review and comments.

This block of changes is the part of the optimization proposed in #7208. @ShvetsKS is busy in other projects, so I will help to adapt the code to the acceptable view.

One of the main problem with the original PR #7208 is high amount of changing code in one PR. For this reason I separated one significant block of logic in this PR for simplification of reviewing and more easy maintaining.

After this PR I plan to refactor the optimized partition builder from the original PR #7208.
So the presented PR should be considered only as a part of a much more complicated optimization.

@trivialfis
Copy link
Member

Apologies for the confusion and delay. I appreciate the effort in making smaller PRs. Would you like to have a chat over zoom or any other platforms? I'm a little bit confused by the changes.

@razdoburdin
Copy link
Contributor Author

Dear @trivialfis ,
it is a good idea. We use Telegram for conversations. I can add you to the chat if it is suitable for you, if not we can use zoom.

@trivialfis
Copy link
Member

@razdoburdin Thank you, please add @RAMitchell as well.

@razdoburdin
Copy link
Contributor Author

@razdoburdin Thank you, please add @RAMitchell as well.

I sent the invitation to the email from your github profiles

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really look like changes related to anything, as the new methods are not used.

If I was looking to break up the original PR into parts, I would start by modifying interfaces. For example the current partition builder has an interface designed to perform partitioning for a single node - it needs to change to accept multiple nodes. Along with this you could start moving some of the code here

common::ParallelFor2d(space, ctx->Threads(), [&](size_t node_in_set, common::Range1d r) {
that concerns parallelism from outside the PartitionBuilder class to inside.

Once the interfaces are changed it becomes much easier to change the internals.

const size_t* p = std::lower_bound(row_data, row_data + column_size, first_row_idx);
// column_size if all missing
idx_ = p - row_data;
if (!((*state) < column_size)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if (*state >= column_size) is easier to read?

* TODO: Delete after the optimizied partition builder will be implemented
*/
template <typename CastType = Column::BinCmpType>
CastType GetBinId(size_t idx, size_t* state) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the only difference here is having one less template parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only needs for maintaining the compatibility with the current partition builder. When the optimized partition builder from the main PR is implemented, this method will be deleted.

// get number of features
bst_feature_t GetNumFeature() const { return static_cast<bst_feature_t>(type_.size()); }
// get index data ptr
template <typename Data>
const Data* GetIndexData() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function used anywhere. Ideally we would not touch the internals of the class in any case.

auto batch = data::SparsePageAdapterBatch{page.GetView()};
this->Init(batch, std::numeric_limits<float>::quiet_NaN(), gmat, sparse_threshold, n_threads);
}
const ColumnListType& GetColumnList() const { return column_list_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to return the whole vector? Why not just GetColumn(feature_idx)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in opt_partition_builder.h (from main PR) we need to access many columns many times -> therefore a reference to the whole vector is returned to avoid additional perf cost (increment, decrement) caused by returning shared_ptr for each column.

auto const nfeature = static_cast<bst_feature_t>(gmat.cut.Ptrs().size() - 1);
const size_t nrow = gmat.row_ptr.size() - 1;
// get index data ptr
const ByteType* GetIndexData() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

// first_row_id is the first row in the leaf partition
const size_t* row_data = RowIndices();
template <typename BinIdxType, typename CastType = Column::BinCmpType>
CastType GetBinIdx(size_t rid, size_t* state) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used anywhere in this PR?

I think we really need more information on what state is and why it's necessary e.g. It seems to be used to speed up the search for rows (only in the case of sparse data) by keeping track of rose already seen. Also, the function GetBinIdx() would not work if the rows were not accessed in sorted order - this would be a trap for someone trying to use this code in future, so a comment would help. typename BinIdxType looks unused.

@RAMitchell
Copy link
Member

In my opinion this PR should not be merged. It looks like unrelated pieces of code copy pasted from the original PR.

It's definitely possible to implement the desired improvements by gradual refactor, where each part improves existing code and makes sense. Other contributors have been able to do this for very complicated changes, even with most PRs under 200 lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants