[Breaking] Let's make histogram method the default #7049

RukhovichIV · 2021-06-18T15:48:06Z

We suggest changing the default method for large datasets from approx to hist as it's much faster. This way, XGBoost will perform better for those users who don't choose tree method themselves.
An attempt was already made in #$5178, but the PR wasn't merged.

Here're perf measurements:

Intel(R) Xeon(R) Platinum 8280L CPU
2 sockets, 28 cores per socket, HT:on

dataset	hist training, s	approx training, s	hist training speedup	test metric name	metric difference (hist - approx)
airline	120.38	5588.1	46.4	accuracy	0
bosch	21.26	56.39	2.7	accuracy	0
covtype	7.23	248.71	34.4	accuracy	0.00533
epsilon	319.06	226.47	0.7	accuracy	0
fraud	0.92	6.21	6.7	accuracy	0
higgs	18.7	448.81	24	accuracy	0
airline-ohe	53.04	1678.80	31.7	accuracy	0
higgs1m	23.43	352.28	15	accuracy	0
letters	68.68	157.31	2.3	accuracy	0.00100
mlsr	82.03	1428.85	17.4	accuracy	0.00112
plasticc	3.16	4.39	1.4	accuracy	0.22073
santander	220.01	281.5	1.3	accuracy	0
year_prediction_msd	4.44	16.7	3.8	RMSE	-0.00039
abalone	2.56	4.68	1.8	RMSE	-0.11244
mortgage1Q	16.35	360.9	22.1	RMSE	-0.00012
url	34.09	92.63	2.7	train RMSE	-0.00179

Geometric mean for training time speedup is 5.667. We still have a slowdown on epsilon, and we're working on this case right now.
Metrics are equal or better than with approx for all cases.

…-hist

trivialfis · 2021-06-19T07:54:44Z

Restarted the CI.

There are some issues with hist, like its incomplete support for external memory.

SmirnovEgorRu · 2021-06-20T07:01:01Z

@trivialfis, do you prefer to keep approx with auto when external memory is used? There:

  } else if (!fmat->SingleColBlock()) {
    LOG(INFO) << "Tree method is automatically set to 'hist' "
                 "since external-memory data matrix is used.";
    tparam_.tree_method = TreeMethod::kHist;
  }

trivialfis · 2021-06-23T16:55:04Z

I made a similar attempt before, will look into this again. Thanks for running the comprehensive benchmark.

trivialfis · 2021-06-28T07:30:26Z

Hi could you please take a look into the failing JVM tests?

RukhovichIV · 2021-06-29T17:27:52Z

There are some issues with hist, like its incomplete support for external memory.

It seems like the problem still appears even if we use approx for external memory (if (!fmat->SingleColBlock()) tparam_.tree_method = TreeMethod::kApprox;). Do you have any other ideas?

Most of the tests (5/7) are failing in a place like this:
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala#L222

Here we perform training with ScalaXGBoost and a usual C++ interface. Then we comparing its results and see the difference.
The error only occurs when auto or hist is passed as a tree_method, and everything works ok when approx is used (as in the last commit 715ca2b)

trivialfis · 2021-07-04T07:30:16Z

Sorry for the late reply.

Hey, @hcho3 @CodingCat @RAMitchell could you please join the discussion?

hcho3 · 2021-07-06T03:15:00Z

I am in favor of making hist the default method, except when external memory is used.

trivialfis · 2021-07-06T05:33:22Z

I don't have a strong feeling. But I'm migrating the approx tree method to hist's code base (for example #7079) to have uniform categorical data support. After the migration I expect the approx tree method to become much faster.

hcho3 · 2021-07-06T06:30:06Z

@trivialfis Do you have a tracking issue for merging approx with hist?

trivialfis · 2021-07-06T06:51:02Z

@hcho3 I don't. It's a mix of different refactors. I need to:

Implement external memory support with GHistIndexMatrix used by hist. This is started with Move GHistIndex into DMatrix. #7064 .
Make evaluation function from hist reusable. Extract evaluate splits from CPU hist. #7079 .
Make histogram builder reusble.
Make part of the partitioner reusble.
Modify sketching function to support hessian as weights.

trivialfis · 2021-07-06T06:53:22Z

I can merge these items into the categorical data support tracker.

hcho3 · 2021-07-06T06:55:56Z

@trivialfis So is it fair to say that there are some useful utility functions in approx that you'd like to see it merged into hist? So far, our approach has been to direct all development effort to the hist method.

trivialfis · 2021-07-06T06:58:31Z

So is it fair to say that there are some useful utility functions in approx that you'd like to see it merged into hist?

There are 2 features in approx that I'm not willing to remove:

External memory support.
Use hessian as weights.

direct all development effort to the hist method.

That's the reason I need to migrate approx to hist's code base and make things as reusble as possible. Whatever improvement goes into hist will go into approx for free.

trivialfis · 2021-07-06T07:04:44Z

I can merge these items into the categorical data support tracker.

Done.

SmirnovEgorRu · 2021-07-06T21:08:14Z

@trivialfis, @hcho3,
If look at the table above we can observe 2 things:

hist is always better than approx in accuracy/mse at least for datasets what we benchmarked. Probably, we can tune parameters of approx to increase metrics, but OOB it's worse.
hist is better in terms of performance (5.7x times) and we see opportunities to make it even better in future.

Another point - alignment:

GPU version of XGB has only hist method, but for CPU default - approx for large data. It's expected that hist on CPU and GPU have pretty similar accuracy results (math is the same, FP error can affect results mostly). But now when a user switches a device CPU <-> GPU - they see different results, because they use different methods by default.
LGBM has '''hist''' by default

Based on above - I prefer to make hist default. Only one exception - external memory, in this case we can use approx and think how to support this fully in hist.

Unification of code for approx and hist - of course good idea. But it's not so related to the topic of the PR, I think hist will be faster anyway.

trivialfis · 2021-07-07T08:21:33Z

Thanks for the detailed explanation! First I also prefer changing the tree method, but I think we need more work than setting the parameter along. Few reasons I haven't merged this PR yet are:

The comparison carried out here is comparing the implementation, not the algorithm. In theory:

Is approx inherently slower than hist, yes and no. Yes because it needs to run sketching at the beginning of every iteration, no because if I add a condition to skip sketching for constant hessian objectives then it's exactly the same with hist.
Is it inherently less accurate than hist? No, if you use constant hessian objectives like reg:squarederror they should produce identical results.

But we know these aren't true in practice from the accuracy results here. For different outputs, my guess is on the difference of parameters. For hist the tuning parameter for the number of split candidates is max_bin, but for approx it's sketch_ratio + sketch_eps and the default value is much lower than 256 if you translate it back to max_bin. I will unify them during the refactor.

After Export Python Interface for external memory. #7070 is merged (I'm splitting it up for review and a few of the smaller parts are already merged), the external memory implementation should be fairly easy. After which, we can have 1 algorithm that works out of box for most of the scenarios, including most of the training parameters. I'm trying to avoid making more auto-configurations that somehow change results and performance dramatically. (hence consistent). Throw an error if something is not implemented, don't configure.

But now when a user switches a device CPU <-> GPU - they see different results, because they use different methods by default.

My personal preference is to remove the name gpu_hist completely as discussed in #6212 (comment) . We can continue the discussion there, I linked another PR there with detailed notes on the caveat of the current parameters set.

Having said that, I'm looking forward to the change of the default tree method to hist, but we need to handle the external memory properly (without auto-configuration) and get the comparison result from a more unified implementation.

These are my personal preference. I'm looking forward to replies. ;-)

…-hist

trivialfis · 2021-07-13T14:38:55Z

Please ignore the R failure for now. It's caused by stalled R cache on github action.

#7102

…-hist

codecov-commenter · 2021-07-16T10:43:46Z

Codecov Report

Merging #7049 (59b40a6) into master (d7c1449) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #7049   +/-   ##
=======================================
  Coverage   81.60%   81.60%           
=======================================
  Files          13       13           
  Lines        3903     3903           
=======================================
  Hits         3185     3185           
  Misses        718      718

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7c1449...59b40a6. Read the comment docs.

trivialfis · 2021-08-02T16:51:41Z

Running some tests with the latest implementation of external memory. Hopefully can narrow down the failures.

…-hist

* Changing hist border to 2^18

RukhovichIV · 2021-08-30T05:25:14Z

Let's try to take another look here

As we know, hist method works much faster with a large data size. At the moment, the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M). We compared the performance and metrics for hist and exact on many workloads and came to the conclusion that it would be optimal to choose 2^18 (~250k) as the threshold. Below are brief tables with the best thresholds for different workloads.

We chose the best threshold based on the training time and two testing metrics on each case. It was grid-searched as the power of 2, starting from 256. We used accuracy + log_loss for classification and rmse + r2 for regression. "Optimal threshold" means the minimum data size at which hist starts performing at least as well as exact

Before the start of the training, all the datasets were randomly shuffled. Next, the first N lines from training datasets were selected for training. Full testing datasets were used for testing. The procedure was repeated for hist and exact.

Classification task:

dataset	train size	optimal train threshold	optimal accuracy threshold	optimal cross entropy threshold
airline-ohe	1M	4096	256	262144
higgs1m	1M	512	256	262144
letters	16k	4096	256	2048
plasticc	7k	2048	256	256
santander	190k	32768	256	8192
airline	92M	256	256	262144
bosch	1.184M	131072	256	131072
epsilon	400k	131072	256	400000
fraud	228k	4096	256	65536
higgs	8.8M	512	256	65536
mlsr	3.02M	16384	16384	8192

Regression task:

dataset	train size	optimal train threshold	optimal rmse threshold	optimal r2 threshold
abalone	~3.3k	256	4096	4096
year	464k	16384	262144	262144
mortgage1q	9.01M	1024	65536	65536

HW:
CPU: Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Socket(s): 2
Core(s) per socket: 28
Thread(s) per core: 2
RAM: 24*16G

The full table with all numbers can be found here

trivialfis · 2021-08-30T18:14:29Z

the threshold for choosing between hist and exact in the heuristic is too high (it is 2^22 or ~4M).

If we can proceed with this change, let's remove the selection altogether. Just use 1 algorithm (hist) as default instead of the "auto".

Igor Rukhovich added 5 commits June 1, 2021 14:01

Changing Approx to Hist in heuristic

c364b5b

Modified docs with hist method in heuristic

acd1027

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

4c5f068

…-hist

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

40b261e

…-hist

moved back the border for now

88293f5

trivialfis self-requested a review June 28, 2021 07:29

trivialfis self-assigned this Jun 28, 2021

explicitly use approx in spark tests

715ca2b

SmirnovEgorRu marked this pull request as draft July 6, 2021 20:48

trivialfis changed the title ~~Let's make histogram method the default~~ [Breaking] Let's make histogram method the default Jul 7, 2021

Igor Rukhovich added 3 commits July 9, 2021 11:58

expanding the java-spark test coverage for hist method

32365a6

Explicitly call hist to make trainings equal

e4ade09

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

e2b017d

…-hist

Igor Rukhovich added 4 commits July 15, 2021 18:47

External case must be checked before distributed

11fa98d

Explicit approx for spark external tests

611ffc5

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

23c319a

…-hist

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

59b40a6

…-hist

vepifanov mentioned this pull request Jul 18, 2021

Introducing DPC++-based plugin (updater) supporting oneAPI programming model #6212

Open

Merge branch 'master' of https://github.com/dmlc/xgboost into default…

bad86d3

…-hist

RukhovichIV mentioned this pull request Aug 24, 2021

Changing the threshold value of the method selection RukhovichIV/xgboost#5

Merged

Changing the threshold value of the method selection (#5)

7719321

* Changing hist border to 2^18

trivialfis added this to 2.0 in 2.0 Roadmap Oct 21, 2021

trivialfis moved this from 2.0 to 1.6 in 2.0 Roadmap Oct 22, 2021

trivialfis moved this from 1.6 TO DO to 2.0 in 2.0 Roadmap Mar 31, 2022

trivialfis mentioned this pull request Jun 22, 2023

Use hist as the default tree method. #9320

Merged

trivialfis closed this in #9320 Jun 27, 2023

trivialfis moved this from 2.0 TODO to 2.0 Done in 2.0 Roadmap Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking] Let's make histogram method the default #7049

[Breaking] Let's make histogram method the default #7049

RukhovichIV commented Jun 18, 2021 •

edited

trivialfis commented Jun 19, 2021

SmirnovEgorRu commented Jun 20, 2021

trivialfis commented Jun 23, 2021

trivialfis commented Jun 28, 2021

RukhovichIV commented Jun 29, 2021 •

edited

trivialfis commented Jul 4, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021 •

edited

trivialfis commented Jul 6, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021

trivialfis commented Jul 6, 2021

SmirnovEgorRu commented Jul 6, 2021

trivialfis commented Jul 7, 2021 •

edited

trivialfis commented Jul 13, 2021 •

edited

codecov-commenter commented Jul 16, 2021 •

edited

trivialfis commented Aug 2, 2021

RukhovichIV commented Aug 30, 2021

trivialfis commented Aug 30, 2021

[Breaking] Let's make histogram method the default #7049

[Breaking] Let's make histogram method the default #7049

Conversation

RukhovichIV commented Jun 18, 2021 • edited

Here're perf measurements:

trivialfis commented Jun 19, 2021

SmirnovEgorRu commented Jun 20, 2021

trivialfis commented Jun 23, 2021

trivialfis commented Jun 28, 2021

RukhovichIV commented Jun 29, 2021 • edited

trivialfis commented Jul 4, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021 • edited

trivialfis commented Jul 6, 2021

hcho3 commented Jul 6, 2021

trivialfis commented Jul 6, 2021

trivialfis commented Jul 6, 2021

SmirnovEgorRu commented Jul 6, 2021

trivialfis commented Jul 7, 2021 • edited

trivialfis commented Jul 13, 2021 • edited

codecov-commenter commented Jul 16, 2021 • edited

Codecov Report

trivialfis commented Aug 2, 2021

RukhovichIV commented Aug 30, 2021

trivialfis commented Aug 30, 2021

RukhovichIV commented Jun 18, 2021 •

edited

RukhovichIV commented Jun 29, 2021 •

edited

trivialfis commented Jul 6, 2021 •

edited

trivialfis commented Jul 7, 2021 •

edited

trivialfis commented Jul 13, 2021 •

edited

codecov-commenter commented Jul 16, 2021 •

edited