Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call for contribution: improve multi-core CPU performance of 'hist' #3810

Closed
hcho3 opened this issue Oct 18, 2018 · 44 comments
Closed

Call for contribution: improve multi-core CPU performance of 'hist' #3810

hcho3 opened this issue Oct 18, 2018 · 44 comments

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Oct 18, 2018

It is about time to tackle the elephant in the room: performance on multi-core CPUs.

Description of Problem

Currently, the hist tree-growing algorithm (tree_method=hist) scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased. This issue was discovered by @Laurae2's Gradient Boosting Benchmark (GitHub repo).

The scaling behavior is as follows for Bosch dataset:
Performance scaling on C5.9xlarge

Call for Contribution

I have identified the performance bottleneck of the 'hist' algorithm and put it in a small repository: hcho3/xgboost-fast-hist-perf-lab. You can try to improve the performance by revising src/build_hist.cc.

Some ideas

  • Change the data matrix layout from CSR to other layouts such as ellpack
  • Try to distribute work more equally among worker threads. Work imbalance is caused by irregular sparse patterns of the data matrix.
  • Group complementary features together, a common scenario for one-hot-encoded data.
@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 18, 2018

@Laurae2 Thank you for preparing the GBT benchmark. It has been helpful in identifying the problem spot.

@trivialfis
Copy link
Member

@hcho3 Does OpenMP guided schedule help load balancing? If so the ellpack won't be very useful.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

My guess is that static allocation of work using ellpack would achieve balanced workload with lower overhead than guided or dynamic mode of OpenMP. With dynamic, you get runtime overhead of maintaining work stealing queue

@CodingCat
Copy link
Member

might be a bit off-topic, do we have benchmark results of approx?

We find out the sub-optimal speedup with multi-threading in our internal environment...want to look at others' data

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

@CodingCat The linked benchmark suite only uses hist. Does approx show performance degradation like hist (e.g. 36 threads slower than 3 threads)?

@CodingCat
Copy link
Member

@hcho3 due to the limitation in our cluster, we can test only with up to 8 threads...but we find very limited speedup comparing 8 to 4.....

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

@CodingCat You mean 8 threads run slower than 4?

@Laurae2
Copy link
Contributor

Laurae2 commented Oct 24, 2018

@CodingCat approx has so poor scaling I didn't even want to try benchmarking it. It doesn't even scale properly on my 4 core laptop (3.6 GHz), therefore I don't even imagine with 64 or 72 threads.

@hcho3 I'll get a look at it using your repository with VTune later.


For those who want to get detailed performance in VTune, one can use the following to add to the header:

#include <ittnotify.h> // Intel Instrumentation and Tracing Technology

Add the following before what you want to track outside a loop (rename the strings/variables):

__itt_resume();
__itt_domain* domain = __itt_domain_create("MyDomain");
__itt_string_handle* task = __itt_string_handle_create("MyTask");
__itt_task_begin(domain, __itt_null, __itt_null, task);

Add the following after what you want to track outside a loop (rename the strings/variables):

__itt_task_end(domain);
__itt_pause();

And start a project with VTune with the correct parameters for the number of threads. Start the executable with paused instrumentation to do performance analysis.

@CodingCat
Copy link
Member

@hcho3 it's not slower, but maybe only 15-% speedup with 4 more threads...(if I conduct more experiments, I would suspect the results would even converge.....

@CodingCat
Copy link
Member

@Laurae2 looks like I am not the only one

@Laurae2
Copy link
Contributor

Laurae2 commented Oct 24, 2018

@hcho3 I'll try to get you some scaling results before the end of this week if no one does on exact, approx and hist, all with depth=6, on commit e26b5d6.

I migrated recently my compute server, and I'm re-doing new benchmarks on Bosch on a new machine with 3.7 GHz all turbo / 36 cores / 72 threads / 80 GBps RAM bandwidth this week.

@RAMitchell
Copy link
Member

The fast_hist updater should be much faster for distributed xgboost. @CodingCat I am surprised no one has tried to add AllReduce calls so it works in distributed mode.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

@RAMitchell I was pretty new when I wrote the fast_hist updater, so it lacks distributed mode support. I'd like to get to it after 0.81 release.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

@Laurae2 FYI, I ran your benchmark suite on a C5.9xlarge machine and the results for XGBoost hist seem to be consistent with your previous results. I can put up the numbers if you'd like.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 24, 2018

@Laurae2 Also, I have access to EC2 machines. If you have a script you'd like to run on a EC2 instance, let me know.

@CodingCat
Copy link
Member

@RAMitchell I was pretty new when I wrote the fast_hist updater, so it lacks distributed mode support. I'd like to get to it after 0.81 release.

@hcho3 if you don't mind, I can take the challenge to get the distributed faster histogram algorithm, I am currently half time on it in my Uber job and next year may have more time on xgboost

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 29, 2018

@CodingCat That would be great, thanks! Let me know if you have any question about the 'hist' code.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 29, 2018

@CodingCat FYI, I plan to add unit tests for 'hist' updater soon after 0.81 release. That should help when it comes to adding distributed support.

@Laurae2
Copy link
Contributor

Laurae2 commented Oct 29, 2018

@hcho3 @CodingCat approx seems to have been removed in the last month, is it an expected behavior?

70d208d#diff-53a3a623be5ce5a351a89012c7b03a31 (PR #3395 has removed tree_method = approx?) => getting identical results between approx and non approx...

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 29, 2018

@Laurae2 Looks like the refactor removed a INFO message about approx being selected. Otherwise, approx should be still available.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 29, 2018

@Laurae2 Actually, you are right. Even though approx is still in the codebase, for some reason it is not being invoked even when tree_method=approx is set. I will investigate this bug ASAP.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 29, 2018

Issue #3840 was filed. Release 0.81 won't be released until this is fixed.

@Laurae2
Copy link
Contributor

Laurae2 commented Nov 1, 2018

@hcho3 I'm finding something very strange on my server with fast histogram, I'll let you know the results if tomorrow the benchmark computation finishes (we're talking about huge negative efficiency of fast histogram, it's so huge I'm trying to measure it but hope it doesn't get too long).

For approx, the poor efficiency is way better than expected, but I don't expect it to be true for any computer (maybe it gets better with newer Intel CPU generation = higher RAM frequency?). I'll post the data once fast histogram finishes on my server.

For information, I'm using Bosch dataset with 477 features (the features with less than 5% missing values).

Reached over 3000 hours of CPU time... (at least my server is put for good use for a while) next for me will be to look at https://github.com/hcho3/xgboost-fast-hist-perf-lab/blob/master/src/build_hist.cc with Intel VTune.

@hcho3 If you want, I can provide you my benchmark R script once my server finishes computing. I ran with depth=8 and nrounds=50, for all tree_method=exact, tree_method=approx (with updater=grow_histmaker,prune workaround, before #3849 ), and tree_method=hist, from 1 to 72 threads. It may uncover more interesting stuff to work on (and you would be able to test it on AWS also).

@Laurae2
Copy link
Contributor

Laurae2 commented Nov 3, 2018

Please see the preliminary results below, ran 7 times to average results. Make sure to click to view better. Synthetic table provided. Unlike the plots shows, the CPUs were not pinned.

The charts clearly seem way different than the ones I was prepared for... (due to how strange the behavior is I'm re-running this with UMA on (NUMA off)). Later I'll check with Intel VTune.

Hardware and Software:

  • CPU: Dual Xeon Gold 6154, 3.7 GHz all turbo, 2x 18 cores / 36 threads
  • RAM: 4x 64GB RAM DDR4 2666 MHz (dual-channel, approx 80 GBps bandwidth)
  • BIOS: NUMA on, Sub NUMA clustering off
  • Operating System: Pop_OS! 18.10
  • Governor: performance
  • Kernel: 4.18.0-10
  • Kernel flags: pti=off spectre_v2=off spec_store_bypass_disable=off l1tf=off noibrs noibpb nopti no_stf_barrier
  • Compiler: gcc 8.2.0
  • R: 3.5.1 compiled with gcc 8.2.0 and with Intel MKL
  • additional compilation flags in R: -O3 -mtune=native

Meltdown / Spectre protections:

laurae@laurae-compute:~$ head /sys/devices/system/cpu/vulnerabilities/*
==> /sys/devices/system/cpu/vulnerabilities/l1tf <==
Mitigation: PTE Inversion; VMX: vulnerable

==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spec_store_bypass <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Mitigation: __user pointer sanitization

==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Vulnerable
Threads Exact (efficiency) Approx (efficiency) Hist (efficiency)
1 1367s (100%) 1702s (100%) 69.9s (100%)
2 758.7s (180%) 881.0s (193%) 52.5s (133%)
4 368.6s (371%) 445.6s (382%) 31.7s (221%)
6 241.5s (566%) 219.6s (582%) 24.1s (290%)
9 160.4s (852%) 194.4s (875%) 23.1s (303%)
18 86.3s (1583%) 106.3s (1601%) 24.2s (289%)
27 66.4s (2059%) 80.2s (2122%) 63.6s (110%)
36 52.9s (2586%) 60.0s (2837%) 55.2s (127%)
54 215.4s (635%) 289.5s (588%) 343.0s (20%)
72 218.9s (624%) 295.6s (576%) 1237.2s (6%)

xgboost Exact speed:
image

xgboost Exact efficiency:
image

xgboost Approx speed:
image

xgboost Approx efficiency:
image

xgboost Histogram speed:
image

xgboost Histogram efficiency:
image

@RAMitchell
Copy link
Member

Looks like a problem with multiple sockets.

@Laurae2
Copy link
Contributor

Laurae2 commented Jan 15, 2019

As commented in #3957 (comment), I tested the commits a2dc929 (pre CPU improvement) and 5f151c5 (post CPU improvement).

I tested using my Dual Xeon 6154 server (gcc compiler, not Intel), using Bosch for 500 iterations, eta 0.10, and depth 8, with 3 runs each for 1 to 72 threads. We notice a performance increase of about up to 50% (1/3 faster) for multithreaded workloads at peak performance.

Here are the results for before #3957 (commit a2dc929):

image

Here are the results for #3957 (commit 5f151c5):

image

Using the efficiency curves, we see the 50% scalability increase (this does not mean the issue is solved: we still have to improve it, if we can - ideally, if we can get to the 1000-2000% range that would be insanely great).

Efficiency curve of a2dc929:

image

Efficiency curve of 5f151c5:

image

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 16, 2019

Thanks @Laurae2, I'll go ahead and pin this issue, so that it's always on top of the issue tracker. There is indeed more work to do.

@hcho3 hcho3 pinned this issue Jan 16, 2019
@Laurae2
Copy link
Contributor

Laurae2 commented Jan 26, 2019

@hcho3 @SmirnovEgorRu I am seeing a small CPU performance regression on singlethreaded workloads on 100% dense data with the commit 5f151c5 which incurs a 10%-15% penalty overall when doing hyperparameter tuning on X cores x 1 xgboost thread.

Here is an example of 50M rows x 100 column random dense data (gcc 8), requires at least 256GB RAM to train it properly from Python / R, run 3 times (6 days).

Commit a2dc929 :

image

Commit 5f151c5 :

image

Although they lead to very similar multithreaded performance, the singlethreaded performance is hit by a slower training (@SmirnovEgorRu 's improvements still scale faster, reaching in this 50M x 100 case 500% efficiency at 11 threads vs 13 threads before).

Excluding the gmat creation time, we have for singlethread on 50M x 100:

Commit Total gmat time Train time
a2dc929 2926s 816s 2109s
5f151c5 (+13%) 3316s (~%) 817s (+18%) 2499s

@SmirnovEgorRu
Copy link
Contributor

@hcho3 @Laurae2 Generally Hyper-threading helps only in case of Core-bound algorithms, no memory-bound algorithm.
HT helps to load pipeline of CPU by more instructions for execution. If most of instructions wait for execution of previous instructions (latency bound) - HT can really helps, in some specific workloads I observed speedup up to 1.5x times.
However, if your application spends most of time on working with memory (memory-bound) - HT makes even worse. 2 hyper-threads share one cpu-cache and displace useful information each other. As result, we see performance degradation.
Gradient Boosting - memory bound algorithm. Usage of HT shouldn't bring performance improvement in any cases and your maximum speedup due to threading vs 1thread version is limited by number of hardware cores. So, my opinion better to measure performance on CPU without HT.

What about NUMA - I observed the same problems at DAAL implementation. It requires control of memory usage by each core. I will look at it in the future.

What about small slowdown on 1 thread - I will investigate it. I think - fix is easy.

@hcho3 At the moment I'm working on the next part of optimizations. I hope I will be ready for new pull-request in near future.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jan 30, 2019

@SmirnovEgorRu Thank you again for your effort. FYI, there was a recent discussion about increasing amount of parallelism by performing level-wise node expansion: #4077.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 4, 2019

@Laurae2 Now that we merged in #3957, #4310, and #4529, can we assume that the scaling issue has been solved? Effects of NUMA may still be problematic.

@hcho3 hcho3 unpinned this issue Jul 4, 2019
@Laurae2
Copy link
Contributor

Laurae2 commented Jul 5, 2019

@hcho3 I will rebench later to check, but from what I could notice there were performance regressions on production environments (especially #3957 causing more than 30x slowdown).

I'll check performance results with @szilard also.

Open example: szilard/GBM-perf#9

@szilard
Copy link
Contributor

szilard commented Sep 17, 2020

The multicore scaling and actually also the NUMA issue has been largely improved indeed:

Multicore:

Screen Shot 2020-09-17 at 12 37 55 AM

Very notable the improvement on smaller data (0.1M rows)

Screen Shot 2020-09-17 at 12 43 26 AM
Screen Shot 2020-09-17 at 12 43 34 AM

More details here:

https://github.com/szilard/GBM-perf#multi-core-scaling-cpu
szilard/GBM-perf#29 (comment)

Also the NUMA issue has been largely mitigated:

Screen Shot 2020-09-17 at 12 46 49 AM

Screen Shot 2020-09-17 at 12 48 23 AM
Screen Shot 2020-09-17 at 12 48 32 AM

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 17, 2020

@szilard Thank you so much for taking time to do the benchmark! And it's great news that XGBoost has improved in the CPU performance scaling.

@szilard
Copy link
Contributor

szilard commented Sep 17, 2020

Yeah, great job everyone on this thread for having accomplished this.

@szilard
Copy link
Contributor

szilard commented Sep 17, 2020

FYI, here are the training times on 1M rows on EC2 r4.16xlarge (2 sockets with 16c+16HT each) on 1, 16 (1so&no HT) and 64 (all) cores for different versions of xgboost:

Screen Shot 2020-09-17 at 11 11 50 AM

szilard/GBM-perf#40

@SmirnovEgorRu
Copy link
Contributor

@szilard, thank you very much for the analysis! Good to hear that the optimizations work.

P.S. Above I see that XGB 1.2 has some regression against 1.1 version. It's very interesting info, let me clarify this. It's not expected for me.

@SmirnovEgorRu
Copy link
Contributor

@szilard, if this topic is interesting for you - some background and results of the CPU optimizations are available in this blog:
https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115

@szilard
Copy link
Contributor

szilard commented Sep 18, 2020

Thanks @SmirnovEgorRu for your optimization work and for the link to the blog post (I did not see this post before).

To be easier to reproduce my numbers and to get new ones in the future and or other hardware, I made a separate Dockerfile for this:

https://github.com/szilard/GBM-perf/tree/master/analysis/xgboost_cpu_by_version

You'll need to set the CPU core ids for the first socket, no hyper threaded cores (e.g. 0-15 on r4.16xlarge, which has 2 sockets, 16c+16HT each) and the xgboost version:

VER=v1.2.0
CORES_1SO_NOHT=0-15    ## set physical core ids on first socket, no hyperthreading
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) --build-arg VER=$VER -t gbmperf_xgboost_cpu_ver .
sudo docker run --rm -e CORES_1SO_NOHT=$CORES_1SO_NOHT gbmperf_xgboost_cpu_ver

It might be worth running the script several times, the training times on all cores usually show somewhat higher variability, not sure if because of the virtualization environment (EC2) or because of NUMA.

@szilard
Copy link
Contributor

szilard commented Sep 21, 2020

Results on c5.metal which has higher frequency and more cores than r4.16xlarge I have been using in the benchmark:

szilard/GBM-perf#41

TLDR: xgboost takes the most advantage of faster and more cores vs other libs. 👍

@szilard
Copy link
Contributor

szilard commented Sep 21, 2020

I wonder though about this:

Screen Shot 2020-09-21 at 9 57 31 AM

the speedup from 1 to 24 cores for xgboost is smaller for the larger data (10M rows, panels on the right) than for smaller data (1M rows, panels in the middle column). Is this some kind of increased cache hits or something that other libs don't have?

@szilard
Copy link
Contributor

szilard commented Sep 21, 2020

Here are some results on AMD:

szilard/GBM-perf#42

Looks like the xgboost optimizations are working great on AMD as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants