Using column_sampler for optimization of ColWiseBuildHist #8319

razdoburdin · 2022-10-07T14:20:09Z

Hi,
this PR is a part of #7192.
It continues the optimization of BuildHistKernel, which was started at #8233.
In #8233 I removed the optimization related to column_sampler to simplify the reviewing process. Here I return it.

The methodology of the benchmarking is the same as in #8218. The only dataset affected by optimization is santander. Training speed for santander increases twice after applying this optimization. You can find the exact numbers at #8233.

I am looking forward for your review and comments!

hcho3 · 2022-10-07T18:08:03Z

Please ignore the failing MacOS tests from GitHub Actions. We're working to fix it.

razdoburdin · 2022-10-12T14:34:48Z

Hi @trivialfis ,
what is your opinion about this optimization?

trivialfis · 2022-10-14T14:09:51Z

Apologies for the slow response, I will get to this tomorrow.

razdoburdin · 2022-10-21T09:28:54Z

Apologies for the slow response, I will get to this tomorrow.

Hi,
have you any decision about this PR?

trivialfis

Apologies for the delay. May I ask the specific objective of this PR? Are you trying to optimize the case to build for less columns when sampling is enabled?

I'm concerned with optimizations that don't generalize and requires edge cases. I haven't been able to track which test can confirm the correctness of the code.

trivialfis · 2022-10-21T11:02:25Z

src/common/hist_util.cc

@@ -284,13 +300,14 @@ void ColsWiseBuildHistKernel(const std::vector<GradientPair> &gpair,
  };

  const size_t n_features = gmat.cut.Ptrs().size() - 1;
-  const size_t n_columns = n_features;
+  const size_t n_columns = kColumnSampling ? fids.size() : n_features;


Is fids empty if there's no sampling?

Fids is empty is case of condition here is false.

trivialfis · 2022-10-21T11:03:51Z

src/tree/hist/histogram.h

@@ -76,6 +84,20 @@ class HistogramBuilder {
      buffer_.Reset(this->n_threads_, n_nodes, space, target_hists);
    }

+    constexpr float kColsampleTh = 0.1;
+    bool column_sampling = (column_sampler_ != nullptr) &&


When column sampler is nullptr?

I set it to nullptr in tests of the histogram builder without using this optimization.

trivialfis · 2022-10-21T11:04:29Z

src/tree/hist/histogram.h

@@ -76,6 +84,20 @@ class HistogramBuilder {
      buffer_.Reset(this->n_threads_, n_nodes, space, target_hists);
    }

+    constexpr float kColsampleTh = 0.1;


It is an ad-hoc threshold value.

trivialfis · 2022-10-21T11:05:00Z

src/tree/hist/histogram.h

+    constexpr float kColsampleTh = 0.1;
+    bool column_sampling = (column_sampler_ != nullptr) &&
+                           (train_param_.colsample_bytree < kColsampleTh ||
+                            train_param_.colsample_bylevel < kColsampleTh);


how about bynode? Is it used?

Not now, maybe later, one can investigate these options more deeply.

razdoburdin · 2022-10-21T11:35:21Z

Apologies for the delay. May I ask the specific objective of this PR? Are you trying to optimize the case to build for less columns when sampling is enabled?

yes, that was the idea

I'm concerned with optimizations that don't generalize and requires edge cases. I haven't been able to track which test can confirm the correctness of the code.

It is reasonable. I added a subcase for TestEvaluateSplits that forces using the code with kColumnSampling = true. The idea is to fill fids vector manually. It is enough for pushing the execution to the branch with kColumnSampling = true.

trivialfis · 2022-10-27T16:25:42Z

Let me try to look deeper into this PR. The optimization is definitely important but I would like to keep lesser conditional branches than it currently has.

dmitry.razdoburdin added 3 commits October 7, 2022 03:54

Introducing the optimization

1806034

Remove unused changes

8eca929

Fix

d6f2977

trivialfis reviewed Oct 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using column_sampler for optimization of ColWiseBuildHist #8319

Using column_sampler for optimization of ColWiseBuildHist #8319

razdoburdin commented Oct 7, 2022

hcho3 commented Oct 7, 2022

razdoburdin commented Oct 12, 2022

trivialfis commented Oct 14, 2022

razdoburdin commented Oct 21, 2022

trivialfis left a comment

trivialfis Oct 21, 2022

razdoburdin Oct 21, 2022

trivialfis Oct 21, 2022

razdoburdin Oct 21, 2022

trivialfis Oct 21, 2022

razdoburdin Oct 21, 2022

trivialfis Oct 21, 2022

razdoburdin Oct 21, 2022

razdoburdin commented Oct 21, 2022 •

edited

trivialfis commented Oct 27, 2022

Using column_sampler for optimization of ColWiseBuildHist #8319

Are you sure you want to change the base?

Using column_sampler for optimization of ColWiseBuildHist #8319

Conversation

razdoburdin commented Oct 7, 2022

hcho3 commented Oct 7, 2022

razdoburdin commented Oct 12, 2022

trivialfis commented Oct 14, 2022

razdoburdin commented Oct 21, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razdoburdin commented Oct 21, 2022 • edited

trivialfis commented Oct 27, 2022

razdoburdin commented Oct 21, 2022 •

edited