Skip to content
This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

hcho3/xgboost-fast-hist-perf-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The hist method of XGBoost scales poorly on multi-core CPUs: a demo script

Currently, the hist tree-growing algorithm (tree_method=hist) of XGBoost scales poorly on multi-core CPUs: for some datasets, performance deteriorates as the number of threads is increased. This issue was discovered by @Laurae2's Gradient Boosting Benchmark.

To make things easier for contributors, I went ahead and isolated the performance bottleneck. A vast majority of time (> 95 %) is spent in a stage known as gradient histogram construction. This repository isolates this stage so that it is easy to fix and improve.

How to compile and run

  1. Compile the script by running CMake:
mkdir build
cd build
cmake ..
make
  1. Download record.tar.bz2 in the same directory.
  2. Extract record.tar.bz2 by running tar xvf record.tar.bz2.
  3. Run the script:
# Usage: ./perflab record/ [number of threads]
./perflab record/ 36

Running with different number of threads should produce the following trend of performance: Performance scaling on C5.9xlarge

What this script does

The script reads from record.tar.bz2, which was processed from the Bosch dataset. Its job is to compute histograms for gradient pairs, where each bin of histogram is a partial sum.

Some background:

  • A gradient for a given instance (X_i, y_i) is a pair of double values that quantify the distance between the true label y_i and predicted label yhat_i.
  • There are as many gradient pairs as there are instances in a training dataset.
  • In order to find optimal splits for decision trees, we compute a histogram of gradients. Each bin of the histogram stands for a range of feature values. The value of the bin is given by the sum of gradients corresponding to the data points lying inside the range.
  • In each boosting iteration, we have to compute multiple histograms, each histogram corresponding to a set of instances.

Setting build types

  • By default, 'Release' build type will be used, with flags -O3 -DNDEBUG.

  • For perfiling, you may want to add debug symbols by choosing 'RelWithDebInfo' build type instead:

    cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..

    This build type uses the following flags: -O2 -g -DNDEBUG.

  • For full control over the compilation flags, specify CMAKE_CXX_FLAGS_RELEASE:

    cmake -DCMAKE_CXX_FLAGS_RELEASE="-O3 -g -DNDEBUG -march=native" ..

    This give you full control over the optimization flags. Here, we are compiling with -O3 -g -DNDEBUG -march=native flags.

    You can check whether they are applied using make VERBOSE=1 and looking at the C++ compilation lines for the existence of the flags you used:

    /usr/bin/c++   -I/home/ubuntu/xgboost-fast-hist-perf-lab/include  -O3 -g -DNDEBUG -march=native
        -fopenmp -std=gnu++11 -o CMakeFiles/perflab.dir/src/main.cc.o
        -c /home/ubuntu/xgboost-fast-hist-perf-lab/src/main.cc

About

Deeper look into performance of tree_method='hist' for multi-core CPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published