Optimization/apply split/column matrix #7991

razdoburdin · 2022-06-14T15:08:07Z

This is a part of #7208 .
I tried to separate changes into the independent pull-requests for simplification of the reviewing. In the current one only the changes in ColumnMatrix are implemented.

…Matrix are implemented.

trivialfis · 2022-06-17T10:45:20Z

We are trying to unblock the CI, will keep you posted.

trivialfis

Apologies for the long wait. I have restored the CI, please rebase to master branch. It seems the PR is reverting some of the latest refactorings. Could you please provide a brief description on what the PR does and why do you need to make those changes?

trivialfis · 2022-06-21T17:48:57Z

src/common/partition_builder.h

@@ -66,10 +65,12 @@ class PartitionBuilder {
    auto p_row_indices = row_indices.data();
    auto n_samples = row_indices.size();

+    const uint32_t first_row_id = n_samples > 0 ? p_row_indices[0] : 0;
+    size_t state = column->GetInitialState(first_row_id);


I hid the state inside the column and made the column an object similar to an iterator. Can we keep that small abstraction?

This PR is a part of big optimization. In another parts partition_builder.h will be fully rewritten (see here for a reference).Thus I think we can not to pay much attention to changes in the current realization in partition_builder.h

Got it. I saw some new changes in the original PR, will look into it Monday.

trivialfis · 2022-06-21T17:49:29Z

src/data/gradient_index.cc

@@ -60,16 +61,17 @@ GHistIndexMatrix::GHistIndexMatrix(DMatrix *p_fmat, bst_bin_t max_bins_per_feat,
 GHistIndexMatrix::~GHistIndexMatrix() = default;

 void GHistIndexMatrix::PushBatch(SparsePage const &batch, common::Span<FeatureType const> ft,
-                                 bst_bin_t n_total_bins, int32_t n_threads) {
+                                 size_t rbegin, bst_bin_t n_total_bins, int32_t n_threads) {


The rbegin is no longer necessary

trivialfis · 2022-06-21T17:51:44Z

src/common/column_matrix.h

+  }
+
+  // construct column matrix from GHistIndexMatrix
+  inline void Init(SparsePage const& page, const GHistIndexMatrix& gmat, double sparse_threshold,


The Batch is for preparing the column matrix to use adapters https://github.com/dmlc/xgboost/blob/master/src/data/adapter.h . We want to avoid SparsePage as it might consume 3 times more memory than the original data.

razdoburdin · 2022-06-23T07:44:20Z

Dear @trivialfis ,
thanks for the review and comments.

This block of changes is the part of the optimization proposed in #7208. @ShvetsKS is busy in other projects, so I will help to adapt the code to the acceptable view.

One of the main problem with the original PR #7208 is high amount of changing code in one PR. For this reason I separated one significant block of logic in this PR for simplification of reviewing and more easy maintaining.

After this PR I plan to refactor the optimized partition builder from the original PR #7208.
So the presented PR should be considered only as a part of a much more complicated optimization.

trivialfis · 2022-07-02T18:36:08Z

Apologies for the confusion and delay. I appreciate the effort in making smaller PRs. Would you like to have a chat over zoom or any other platforms? I'm a little bit confused by the changes.

razdoburdin · 2022-07-04T14:34:49Z

Dear @trivialfis ,
it is a good idea. We use Telegram for conversations. I can add you to the chat if it is suitable for you, if not we can use zoom.

trivialfis · 2022-07-05T09:35:33Z

@razdoburdin Thank you, please add @RAMitchell as well.

razdoburdin · 2022-07-05T09:48:41Z

@razdoburdin Thank you, please add @RAMitchell as well.

I sent the invitation to the email from your github profiles

RAMitchell

This doesn't really look like changes related to anything, as the new methods are not used.

If I was looking to break up the original PR into parts, I would start by modifying interfaces. For example the current partition builder has an interface designed to perform partitioning for a single node - it needs to change to accept multiple nodes. Along with this you could start moving some of the code here

xgboost/src/tree/updater_quantile_hist.h

Line 154 in 2365f82

    
           common::ParallelFor2d(space, ctx->Threads(), [&](size_t node_in_set, common::Range1d r) {

that concerns parallelism from outside the PartitionBuilder class to inside.

Once the interfaces are changed it becomes much easier to change the internals.

RAMitchell · 2022-07-19T08:56:47Z

src/common/column_matrix.h

-    const size_t* p = std::lower_bound(row_data, row_data + column_size, first_row_idx);
-    // column_size if all missing
-    idx_ = p - row_data;
+    if (!((*state) < column_size)) {


Perhaps if (*state >= column_size) is easier to read?

RAMitchell · 2022-07-19T09:07:26Z

src/common/column_matrix.h

+   * TODO: Delete after the optimizied partition builder will be implemented
+   */
+  template <typename CastType = Column::BinCmpType>
+  CastType GetBinId(size_t idx, size_t* state) const {


Looks like the only difference here is having one less template parameter?

This method only needs for maintaining the compatibility with the current partition builder. When the optimized partition builder from the main PR is implemented, this method will be deleted.

RAMitchell · 2022-07-19T09:44:58Z

src/common/column_matrix.h

  // get number of features
  bst_feature_t GetNumFeature() const { return static_cast<bst_feature_t>(type_.size()); }
+  // get index data ptr
+  template <typename Data>
+  const Data* GetIndexData() const {


I don't see this function used anywhere. Ideally we would not touch the internals of the class in any case.

RAMitchell · 2022-07-19T09:47:24Z

src/common/column_matrix.h

-    auto batch = data::SparsePageAdapterBatch{page.GetView()};
-    this->Init(batch, std::numeric_limits<float>::quiet_NaN(), gmat, sparse_threshold, n_threads);
-  }
+  const ColumnListType& GetColumnList() const { return column_list_; }


Any reason to return the whole vector? Why not just GetColumn(feature_idx)?

Yes, in opt_partition_builder.h (from main PR) we need to access many columns many times -> therefore a reference to the whole vector is returned to avoid additional perf cost (increment, decrement) caused by returning shared_ptr for each column.

RAMitchell · 2022-07-19T09:48:14Z

src/common/column_matrix.h

-    auto const nfeature = static_cast<bst_feature_t>(gmat.cut.Ptrs().size() - 1);
-    const size_t nrow = gmat.row_ptr.size() - 1;
+  // get index data ptr
+  const ByteType* GetIndexData() const {


Same as above.

RAMitchell · 2022-07-19T09:58:24Z

src/common/column_matrix.h

-    // first_row_id is the first row in the leaf partition
-    const size_t* row_data = RowIndices();
+  template <typename BinIdxType, typename CastType = Column::BinCmpType>
+  CastType GetBinIdx(size_t rid, size_t* state) const {


Is this method used anywhere in this PR?

I think we really need more information on what state is and why it's necessary e.g. It seems to be used to speed up the search for rows (only in the case of sparse data) by keeping track of rose already seen. Also, the function GetBinIdx() would not work if the rows were not accessed in sorted order - this would be a trap for someone trying to use this code in future, so a comment would help. typename BinIdxType looks unused.

RAMitchell · 2022-07-22T12:27:12Z

In my opinion this PR should not be merged. It looks like unrelated pieces of code copy pasted from the original PR.

It's definitely possible to implement the desired improvements by gradual refactor, where each part improves existing code and makes sense. Other contributors have been able to do this for very complicated changes, even with most PRs under 200 lines.

dmitry.razdoburdin added 5 commits June 13, 2022 07:27

Implementation and tests of column_matrix from dmlc#7208

c30367c

This is a part of large PR dmlc#7208. Here only the changes in Column…

3d7ba63

…Matrix are implemented.

Cleaning the code

1641f91

Fixing the linting

74e22dd

Fixing a bug that has lead to some fails in unit tests

dcc9eb6

razdoburdin mentioned this pull request Jun 17, 2022

Partition optimization #7208

Open

trivialfis reviewed Jun 21, 2022

View reviewed changes

RAMitchell reviewed Jul 19, 2022

View reviewed changes

razdoburdin mentioned this pull request Aug 26, 2022

Optimization/applysplit/common row partitioner #8204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization/apply split/column matrix #7991

Optimization/apply split/column matrix #7991

razdoburdin commented Jun 14, 2022

trivialfis commented Jun 17, 2022

trivialfis left a comment

trivialfis Jun 21, 2022

razdoburdin Jun 23, 2022

trivialfis Jul 2, 2022

trivialfis Jun 21, 2022

trivialfis Jun 21, 2022

razdoburdin commented Jun 23, 2022

trivialfis commented Jul 2, 2022

razdoburdin commented Jul 4, 2022

trivialfis commented Jul 5, 2022

razdoburdin commented Jul 5, 2022

RAMitchell left a comment

RAMitchell Jul 19, 2022

RAMitchell Jul 19, 2022

razdoburdin Jul 22, 2022

RAMitchell Jul 19, 2022

RAMitchell Jul 19, 2022

makart19 Jul 21, 2022

RAMitchell Jul 19, 2022

RAMitchell Jul 19, 2022

RAMitchell commented Jul 22, 2022

Optimization/apply split/column matrix #7991

Are you sure you want to change the base?

Optimization/apply split/column matrix #7991

Conversation

razdoburdin commented Jun 14, 2022

trivialfis commented Jun 17, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razdoburdin commented Jun 23, 2022

trivialfis commented Jul 2, 2022

razdoburdin commented Jul 4, 2022

trivialfis commented Jul 5, 2022

razdoburdin commented Jul 5, 2022

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell commented Jul 22, 2022