Call for contribution: improve multi-core CPU performance of 'hist' #3810

hcho3 · 2018-10-18T23:50:19Z

It is about time to tackle the elephant in the room: performance on multi-core CPUs.

Description of Problem

Currently, the hist tree-growing algorithm (tree_method=hist) scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased. This issue was discovered by @Laurae2's Gradient Boosting Benchmark (GitHub repo).

The scaling behavior is as follows for Bosch dataset:

Call for Contribution

I have identified the performance bottleneck of the 'hist' algorithm and put it in a small repository: hcho3/xgboost-fast-hist-perf-lab. You can try to improve the performance by revising src/build_hist.cc.

Some ideas

Change the data matrix layout from CSR to other layouts such as ellpack
Try to distribute work more equally among worker threads. Work imbalance is caused by irregular sparse patterns of the data matrix.
Group complementary features together, a common scenario for one-hot-encoded data.

The text was updated successfully, but these errors were encountered:

hcho3 · 2018-10-18T23:56:50Z

@Laurae2 Thank you for preparing the GBT benchmark. It has been helpful in identifying the problem spot.

trivialfis · 2018-10-24T03:53:37Z

@hcho3 Does OpenMP guided schedule help load balancing? If so the ellpack won't be very useful.

hcho3 · 2018-10-24T03:58:24Z

My guess is that static allocation of work using ellpack would achieve balanced workload with lower overhead than guided or dynamic mode of OpenMP. With dynamic, you get runtime overhead of maintaining work stealing queue

CodingCat · 2018-10-24T20:13:04Z

might be a bit off-topic, do we have benchmark results of approx?

We find out the sub-optimal speedup with multi-threading in our internal environment...want to look at others' data

hcho3 · 2018-10-24T20:15:40Z

@CodingCat The linked benchmark suite only uses hist. Does approx show performance degradation like hist (e.g. 36 threads slower than 3 threads)?

CodingCat · 2018-10-24T20:17:46Z

@hcho3 due to the limitation in our cluster, we can test only with up to 8 threads...but we find very limited speedup comparing 8 to 4.....

hcho3 · 2018-10-24T20:19:29Z

@CodingCat You mean 8 threads run slower than 4?

Laurae2 · 2018-10-24T20:19:52Z

@CodingCat approx has so poor scaling I didn't even want to try benchmarking it. It doesn't even scale properly on my 4 core laptop (3.6 GHz), therefore I don't even imagine with 64 or 72 threads.

@hcho3 I'll get a look at it using your repository with VTune later.

For those who want to get detailed performance in VTune, one can use the following to add to the header:

#include <ittnotify.h> // Intel Instrumentation and Tracing Technology

Add the following before what you want to track outside a loop (rename the strings/variables):

__itt_resume();
__itt_domain* domain = __itt_domain_create("MyDomain");
__itt_string_handle* task = __itt_string_handle_create("MyTask");
__itt_task_begin(domain, __itt_null, __itt_null, task);

Add the following after what you want to track outside a loop (rename the strings/variables):

__itt_task_end(domain);
__itt_pause();

And start a project with VTune with the correct parameters for the number of threads. Start the executable with paused instrumentation to do performance analysis.

CodingCat · 2018-10-24T20:22:13Z

@hcho3 it's not slower, but maybe only 15-% speedup with 4 more threads...(if I conduct more experiments, I would suspect the results would even converge.....

CodingCat · 2018-10-24T20:22:33Z

@Laurae2 looks like I am not the only one

Laurae2 · 2018-10-24T20:26:52Z

@hcho3 I'll try to get you some scaling results before the end of this week if no one does on exact, approx and hist, all with depth=6, on commit e26b5d6.

I migrated recently my compute server, and I'm re-doing new benchmarks on Bosch on a new machine with 3.7 GHz all turbo / 36 cores / 72 threads / 80 GBps RAM bandwidth this week.

RAMitchell · 2018-10-24T20:27:41Z

The fast_hist updater should be much faster for distributed xgboost. @CodingCat I am surprised no one has tried to add AllReduce calls so it works in distributed mode.

hcho3 · 2018-10-24T20:28:52Z

@RAMitchell I was pretty new when I wrote the fast_hist updater, so it lacks distributed mode support. I'd like to get to it after 0.81 release.

hcho3 · 2018-10-24T20:58:22Z

@Laurae2 FYI, I ran your benchmark suite on a C5.9xlarge machine and the results for XGBoost hist seem to be consistent with your previous results. I can put up the numbers if you'd like.

hcho3 · 2018-10-24T21:02:58Z

@Laurae2 Also, I have access to EC2 machines. If you have a script you'd like to run on a EC2 instance, let me know.

CodingCat · 2018-10-29T02:51:15Z

@RAMitchell I was pretty new when I wrote the fast_hist updater, so it lacks distributed mode support. I'd like to get to it after 0.81 release.

@hcho3 if you don't mind, I can take the challenge to get the distributed faster histogram algorithm, I am currently half time on it in my Uber job and next year may have more time on xgboost

hcho3 · 2018-10-29T04:45:37Z

@CodingCat That would be great, thanks! Let me know if you have any question about the 'hist' code.

hcho3 · 2018-10-29T04:46:55Z

@CodingCat FYI, I plan to add unit tests for 'hist' updater soon after 0.81 release. That should help when it comes to adding distributed support.

Laurae2 · 2018-10-29T14:32:30Z

@hcho3 @CodingCat approx seems to have been removed in the last month, is it an expected behavior?

70d208d#diff-53a3a623be5ce5a351a89012c7b03a31 (PR #3395 has removed tree_method = approx?) => getting identical results between approx and non approx...

hcho3 · 2018-10-29T14:40:44Z

@Laurae2 Looks like the refactor removed a INFO message about approx being selected. Otherwise, approx should be still available.

hcho3 · 2018-10-29T17:13:46Z

@Laurae2 Actually, you are right. Even though approx is still in the codebase, for some reason it is not being invoked even when tree_method=approx is set. I will investigate this bug ASAP.

hcho3 · 2018-10-29T18:16:10Z

Issue #3840 was filed. Release 0.81 won't be released until this is fixed.

Laurae2 · 2018-11-01T21:24:06Z

@hcho3 I'm finding something very strange on my server with fast histogram, I'll let you know the results if tomorrow the benchmark computation finishes (we're talking about huge negative efficiency of fast histogram, it's so huge I'm trying to measure it but hope it doesn't get too long).

For approx, the poor efficiency is way better than expected, but I don't expect it to be true for any computer (maybe it gets better with newer Intel CPU generation = higher RAM frequency?). I'll post the data once fast histogram finishes on my server.

For information, I'm using Bosch dataset with 477 features (the features with less than 5% missing values).

Reached over 3000 hours of CPU time... (at least my server is put for good use for a while) next for me will be to look at https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc with Intel VTune.

@hcho3 If you want, I can provide you my benchmark R script once my server finishes computing. I ran with depth=8 and nrounds=50, for all tree_method=exact, tree_method=approx (with updater=grow_histmaker,prune workaround, before #3849 ), and tree_method=hist, from 1 to 72 threads. It may uncover more interesting stuff to work on (and you would be able to test it on AWS also).

Laurae2 · 2018-11-03T09:20:24Z

Please see the preliminary results below, ran 7 times to average results. Make sure to click to view better. Synthetic table provided. Unlike the plots shows, the CPUs were not pinned.

The charts clearly seem way different than the ones I was prepared for... (due to how strange the behavior is I'm re-running this with UMA on (NUMA off)). Later I'll check with Intel VTune.

Hardware and Software:

CPU: Dual Xeon Gold 6154, 3.7 GHz all turbo, 2x 18 cores / 36 threads
RAM: 4x 64GB RAM DDR4 2666 MHz (dual-channel, approx 80 GBps bandwidth)
BIOS: NUMA on, Sub NUMA clustering off
Operating System: Pop_OS! 18.10
Governor: performance
Kernel: 4.18.0-10
Kernel flags: pti=off spectre_v2=off spec_store_bypass_disable=off l1tf=off noibrs noibpb nopti no_stf_barrier
Compiler: gcc 8.2.0
R: 3.5.1 compiled with gcc 8.2.0 and with Intel MKL
additional compilation flags in R: -O3 -mtune=native

Meltdown / Spectre protections:

laurae@laurae-compute:~$ head /sys/devices/system/cpu/vulnerabilities/*
==> /sys/devices/system/cpu/vulnerabilities/l1tf <==
Mitigation: PTE Inversion; VMX: vulnerable

==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spec_store_bypass <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Mitigation: __user pointer sanitization

==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Vulnerable

Threads	Exact (efficiency)	Approx (efficiency)	Hist (efficiency)
1	1367s (100%)	1702s (100%)	69.9s (100%)
2	758.7s (180%)	881.0s (193%)	52.5s (133%)
4	368.6s (371%)	445.6s (382%)	31.7s (221%)
6	241.5s (566%)	219.6s (582%)	24.1s (290%)
9	160.4s (852%)	194.4s (875%)	23.1s (303%)
18	86.3s (1583%)	106.3s (1601%)	24.2s (289%)
27	66.4s (2059%)	80.2s (2122%)	63.6s (110%)
36	52.9s (2586%)	60.0s (2837%)	55.2s (127%)
54	215.4s (635%)	289.5s (588%)	343.0s (20%)
72	218.9s (624%)	295.6s (576%)	1237.2s (6%)

xgboost Exact speed:

xgboost Exact efficiency:

xgboost Approx speed:

xgboost Approx efficiency:

xgboost Histogram speed:

xgboost Histogram efficiency:

RAMitchell · 2018-11-03T20:45:13Z

Looks like a problem with multiple sockets.

Laurae2 · 2019-01-15T22:10:10Z

As commented in #3957 (comment), I tested the commits a2dc929 (pre CPU improvement) and 5f151c5 (post CPU improvement).

I tested using my Dual Xeon 6154 server (gcc compiler, not Intel), using Bosch for 500 iterations, eta 0.10, and depth 8, with 3 runs each for 1 to 72 threads. We notice a performance increase of about up to 50% (1/3 faster) for multithreaded workloads at peak performance.

Here are the results for before #3957 (commit a2dc929):

Here are the results for #3957 (commit 5f151c5):

Using the efficiency curves, we see the 50% scalability increase (this does not mean the issue is solved: we still have to improve it, if we can - ideally, if we can get to the 1000-2000% range that would be insanely great).

Efficiency curve of a2dc929:

Efficiency curve of 5f151c5:

hcho3 · 2019-01-16T20:29:45Z

Thanks @Laurae2, I'll go ahead and pin this issue, so that it's always on top of the issue tracker. There is indeed more work to do.

Laurae2 · 2019-01-26T09:43:17Z

@hcho3 @SmirnovEgorRu I am seeing a small CPU performance regression on singlethreaded workloads on 100% dense data with the commit 5f151c5 which incurs a 10%-15% penalty overall when doing hyperparameter tuning on X cores x 1 xgboost thread.

Here is an example of 50M rows x 100 column random dense data (gcc 8), requires at least 256GB RAM to train it properly from Python / R, run 3 times (6 days).

Commit a2dc929 :

Commit 5f151c5 :

Although they lead to very similar multithreaded performance, the singlethreaded performance is hit by a slower training (@SmirnovEgorRu 's improvements still scale faster, reaching in this 50M x 100 case 500% efficiency at 11 threads vs 13 threads before).

Excluding the gmat creation time, we have for singlethread on 50M x 100:

Commit	Total	gmat time	Train time
`a2dc929`	2926s	816s	2109s
`5f151c5`	(+13%) 3316s	(~%) 817s	(+18%) 2499s

SmirnovEgorRu · 2019-01-30T18:46:08Z

@hcho3 @Laurae2 Generally Hyper-threading helps only in case of Core-bound algorithms, no memory-bound algorithm.
HT helps to load pipeline of CPU by more instructions for execution. If most of instructions wait for execution of previous instructions (latency bound) - HT can really helps, in some specific workloads I observed speedup up to 1.5x times.
However, if your application spends most of time on working with memory (memory-bound) - HT makes even worse. 2 hyper-threads share one cpu-cache and displace useful information each other. As result, we see performance degradation.
Gradient Boosting - memory bound algorithm. Usage of HT shouldn't bring performance improvement in any cases and your maximum speedup due to threading vs 1thread version is limited by number of hardware cores. So, my opinion better to measure performance on CPU without HT.

What about NUMA - I observed the same problems at DAAL implementation. It requires control of memory usage by each core. I will look at it in the future.

What about small slowdown on 1 thread - I will investigate it. I think - fix is easy.

@hcho3 At the moment I'm working on the next part of optimizations. I hope I will be ready for new pull-request in near future.

hcho3 · 2019-01-30T18:54:07Z

@SmirnovEgorRu Thank you again for your effort. FYI, there was a recent discussion about increasing amount of parallelism by performing level-wise node expansion: #4077.

hcho3 · 2019-07-04T23:31:20Z

@Laurae2 Now that we merged in #3957, #4310, and #4529, can we assume that the scaling issue has been solved? Effects of NUMA may still be problematic.

Laurae2 · 2019-07-05T07:13:32Z

@hcho3 I will rebench later to check, but from what I could notice there were performance regressions on production environments (especially #3957 causing more than 30x slowdown).

I'll check performance results with @szilard also.

Open example: szilard/GBM-perf#9

szilard · 2020-09-17T07:51:30Z

The multicore scaling and actually also the NUMA issue has been largely improved indeed:

Multicore:

Very notable the improvement on smaller data (0.1M rows)

More details here:

https://github.com/szilard/GBM-perf#multi-core-scaling-cpu
szilard/GBM-perf#29 (comment)

Also the NUMA issue has been largely mitigated:

hcho3 · 2020-09-17T07:59:31Z

@szilard Thank you so much for taking time to do the benchmark! And it's great news that XGBoost has improved in the CPU performance scaling.

szilard · 2020-09-17T11:36:21Z

Yeah, great job everyone on this thread for having accomplished this.

szilard · 2020-09-17T18:12:38Z

FYI, here are the training times on 1M rows on EC2 r4.16xlarge (2 sockets with 16c+16HT each) on 1, 16 (1so&no HT) and 64 (all) cores for different versions of xgboost:

szilard/GBM-perf#40

SmirnovEgorRu · 2020-09-18T02:17:46Z

@szilard, thank you very much for the analysis! Good to hear that the optimizations work.

P.S. Above I see that XGB 1.2 has some regression against 1.1 version. It's very interesting info, let me clarify this. It's not expected for me.

SmirnovEgorRu · 2020-09-18T02:21:54Z

@szilard, if this topic is interesting for you - some background and results of the CPU optimizations are available in this blog:
https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115

szilard · 2020-09-18T07:49:26Z

Thanks @SmirnovEgorRu for your optimization work and for the link to the blog post (I did not see this post before).

To be easier to reproduce my numbers and to get new ones in the future and or other hardware, I made a separate Dockerfile for this:

https://github.com/szilard/GBM-perf/tree/master/analysis/xgboost_cpu_by_version

You'll need to set the CPU core ids for the first socket, no hyper threaded cores (e.g. 0-15 on r4.16xlarge, which has 2 sockets, 16c+16HT each) and the xgboost version:

VER=v1.2.0
CORES_1SO_NOHT=0-15    ## set physical core ids on first socket, no hyperthreading
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) --build-arg VER=$VER -t gbmperf_xgboost_cpu_ver .
sudo docker run --rm -e CORES_1SO_NOHT=$CORES_1SO_NOHT gbmperf_xgboost_cpu_ver

It might be worth running the script several times, the training times on all cores usually show somewhat higher variability, not sure if because of the virtualization environment (EC2) or because of NUMA.

szilard · 2020-09-21T10:31:54Z

Results on c5.metal which has higher frequency and more cores than r4.16xlarge I have been using in the benchmark:

szilard/GBM-perf#41

TLDR: xgboost takes the most advantage of faster and more cores vs other libs. 👍

szilard · 2020-09-21T10:59:00Z

I wonder though about this:

the speedup from 1 to 24 cores for xgboost is smaller for the larger data (10M rows, panels on the right) than for smaller data (1M rows, panels in the middle column). Is this some kind of increased cache hits or something that other libs don't have?

szilard · 2020-09-21T18:17:40Z

Here are some results on AMD:

szilard/GBM-perf#42

Looks like the xgboost optimizations are working great on AMD as well.

hcho3 added status: help wanted type: roadmap labels Oct 18, 2018

hcho3 added this to Roadmaps in 2.0 Roadmap Oct 19, 2018

hcho3 mentioned this issue Oct 22, 2018

Fix #3579: Fix memory leak when learing rate callback is registered #3803

Closed

hcho3 mentioned this issue Oct 29, 2018

[Blocking] "tree_method=approx" does not invoke the correct updater #3840

Closed

hcho3 pinned this issue Jan 16, 2019

hcho3 mentioned this issue Mar 4, 2019

[RFC] Version 0.82 release candidate #4201

Merged

Laurae2 mentioned this issue Mar 21, 2019

Performance optimizations for CPUs [part 2] #4278

Closed

SmirnovEgorRu mentioned this issue Apr 11, 2019

Optimizations of pre-processing for 'hist' tree method #4310

Merged

hcho3 mentioned this issue May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

hcho3 unpinned this issue Jul 4, 2019

hcho3 mentioned this issue Jul 24, 2019

Parallel xgboost slows down when using more than 8 threads #4703

Closed

szilard mentioned this issue Jan 13, 2020

Make HistGradientBoostingRegressor/Classifer use _openmp_effective_n_threads to set the default maximum number of threads to use scikit-learn/scikit-learn#16016

Closed

SmirnovEgorRu mentioned this issue Sep 18, 2020

BENCH threading scalability of Hist Gradient Boosting scikit-learn/scikit-learn#18382

Merged

szilard mentioned this issue Sep 18, 2020

xgboost multicore scaling and NUMA improvement by xgboost version szilard/GBM-perf#40

Closed

hcho3 mentioned this issue Dec 16, 2020

LightGBM vs XGBoost accuracy/speed #3417

Closed

angela97lin mentioned this issue Jun 23, 2021

Exposes thread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost alteryx/evalml#2410

Merged

hcho3 closed this as completed Jun 6, 2023

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Comments

hcho3 commented Oct 18, 2018 • edited

Description of Problem

Call for Contribution

Some ideas

hcho3 commented Oct 18, 2018

trivialfis commented Oct 24, 2018

hcho3 commented Oct 24, 2018

CodingCat commented Oct 24, 2018

hcho3 commented Oct 24, 2018

CodingCat commented Oct 24, 2018

hcho3 commented Oct 24, 2018

Laurae2 commented Oct 24, 2018

CodingCat commented Oct 24, 2018

CodingCat commented Oct 24, 2018

Laurae2 commented Oct 24, 2018

RAMitchell commented Oct 24, 2018

hcho3 commented Oct 24, 2018

hcho3 commented Oct 24, 2018

hcho3 commented Oct 24, 2018 • edited

CodingCat commented Oct 29, 2018

hcho3 commented Oct 29, 2018

hcho3 commented Oct 29, 2018

Laurae2 commented Oct 29, 2018

hcho3 commented Oct 29, 2018

hcho3 commented Oct 29, 2018 • edited

hcho3 commented Oct 29, 2018

Laurae2 commented Nov 1, 2018

Laurae2 commented Nov 3, 2018 • edited

RAMitchell commented Nov 3, 2018

Laurae2 commented Jan 15, 2019 • edited

hcho3 commented Jan 16, 2019 • edited

Laurae2 commented Jan 26, 2019

SmirnovEgorRu commented Jan 30, 2019

hcho3 commented Jan 30, 2019

hcho3 commented Jul 4, 2019

Laurae2 commented Jul 5, 2019

szilard commented Sep 17, 2020 • edited

hcho3 commented Sep 17, 2020

szilard commented Sep 17, 2020

szilard commented Sep 17, 2020 • edited

SmirnovEgorRu commented Sep 18, 2020

SmirnovEgorRu commented Sep 18, 2020

szilard commented Sep 18, 2020

szilard commented Sep 21, 2020 • edited

szilard commented Sep 21, 2020 • edited

szilard commented Sep 21, 2020 • edited

hcho3 commented Oct 18, 2018 •

edited

hcho3 commented Oct 24, 2018 •

edited

hcho3 commented Oct 29, 2018 •

edited

Laurae2 commented Nov 3, 2018 •

edited

Laurae2 commented Jan 15, 2019 •

edited

hcho3 commented Jan 16, 2019 •

edited

szilard commented Sep 17, 2020 •

edited

szilard commented Sep 17, 2020 •

edited

szilard commented Sep 21, 2020 •

edited

szilard commented Sep 21, 2020 •

edited

szilard commented Sep 21, 2020 •

edited