Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] New CUDA version Part 1 #4630

Merged
merged 176 commits into from
Mar 23, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
176 commits
Select commit Hold shift + click to select a range
94aed50
new cuda framework
shiyu1994 Apr 20, 2021
18df6b2
add histogram construction kernel
Apr 22, 2021
9b21d2b
before removing multi-gpu
Apr 29, 2021
634a4f1
new cuda framework
Apr 29, 2021
23bcaa2
tree learner cuda kernels
May 6, 2021
6c14cd9
single tree framework ready
May 7, 2021
aa0b3de
single tree training framework
May 9, 2021
bc85ced
remove comments
May 9, 2021
18d957a
boosting with cuda
May 10, 2021
28186c0
optimize for best split find
May 13, 2021
60c7e4e
data split
May 14, 2021
57547fb
move boosting into cuda
May 17, 2021
608fd70
parallel synchronize best split point
May 27, 2021
277be8b
merge split data kernels
Jun 1, 2021
ffcf765
before code refactor
Jun 1, 2021
a58c1e1
use tasks instead of features as units for split finding
Jun 2, 2021
72d41c9
refactor cuda best split finder
Jun 2, 2021
f7a7658
fix configuration error with small leaves in data split
shiyu1994 Jun 2, 2021
b6efd10
skip histogram construction of too small leaf
shiyu1994 Jun 2, 2021
6f4e39d
skip split finding of invalid leaves
shiyu1994 Jun 3, 2021
4072bb8
support row wise with CUDA
shiyu1994 Jun 4, 2021
88ecde9
copy data for split by column
shiyu1994 Jun 4, 2021
dec7501
copy data from host to CPU by column for data partition
shiyu1994 Jun 8, 2021
2dccb7f
add synchronize best splits for one leaf from multiple blocks
shiyu1994 Jun 8, 2021
0168d2c
partition dense row data
shiyu1994 Jun 9, 2021
0570fe0
fix sync best split from task blocks
shiyu1994 Jun 9, 2021
374018c
add support for sparse row wise for CUDA
shiyu1994 Jun 9, 2021
40c49cc
remove useless code
shiyu1994 Jun 9, 2021
dc41a00
add l2 regression objective
shiyu1994 Jun 10, 2021
bd065b7
sparse multi value bin enabled for CUDA
shiyu1994 Jun 11, 2021
a5fadfb
fix cuda ranking objective
shiyu1994 Jun 16, 2021
3202b79
support for number of items <= 2048 per query
Jun 16, 2021
cd687c9
speedup histogram construction by interleaving global memory access
Jun 23, 2021
320c449
split optimization
Jun 28, 2021
eb1d7fa
add cuda tree predictor
Jul 2, 2021
dd177f5
remove comma
Jul 18, 2021
ee836d6
refactor objective and score updater
Jul 19, 2021
0467fce
before use struct
Jul 21, 2021
f05da3c
use structure for split information
Jul 21, 2021
400622a
use structure for leaf splits
Jul 21, 2021
d9d3aa9
return CUDASplitInfo directly after finding best split
Jul 21, 2021
45cf7a7
split with CUDATree directly
Jul 22, 2021
9dea18d
use cuda row data in cuda histogram constructor
Jul 26, 2021
572e2b0
clean src/treelearner/cuda
Jul 26, 2021
fe58d4c
gather shared cuda device functions
Jul 27, 2021
dc461dc
put shared CUDA functions into header file
Jul 27, 2021
ba565c1
change smaller leaf from <= back to < for consistent result with CPU
Jul 27, 2021
a781ef5
add tree predictor
Aug 3, 2021
c8a6fab
remove useless cuda_tree_predictor
Aug 3, 2021
a7504dc
predict on CUDA with pipeline
Aug 4, 2021
896d47b
add global sort algorithms
Aug 9, 2021
fe6ed74
add global argsort for queries with many items in ranking tasks
Aug 9, 2021
7808455
remove limitation of maximum number of items per query in ranking
Aug 11, 2021
7a0d218
add cuda metrics
Aug 16, 2021
ca42f3b
fix CUDA AUC
Aug 18, 2021
c681102
remove debug code
Aug 18, 2021
ea60566
add regression metrics
Aug 19, 2021
5c84788
remove useless file
Aug 19, 2021
c2c2407
don't use mask in shuffle reduce
Aug 19, 2021
b43d367
add more regression objectives
Sep 2, 2021
951aa37
fix cuda mape loss
Sep 3, 2021
b50ce5b
use template for different versions of BitonicArgSortDevice
Sep 3, 2021
f51fd70
add multiclass metrics
Sep 6, 2021
35c742d
add ndcg metric
Sep 7, 2021
510d878
fix cross entropy objectives and metrics
Sep 10, 2021
95f4612
fix cross entropy and ndcg metrics
Sep 10, 2021
bb997d0
add support for customized objective in CUDA
Sep 10, 2021
17b78d1
complete multiclass ova for CUDA
Sep 10, 2021
72aa863
merge master
Sep 13, 2021
8537b8c
separate cuda tree learner
Sep 13, 2021
8fb8562
use shuffle based prefix sum
Sep 13, 2021
883ed15
clean up cuda_algorithms.hpp
Sep 13, 2021
e7ffc3f
add copy subset on CUDA
Sep 15, 2021
d7c4bb4
add bagging for CUDA
Sep 15, 2021
d9bf3e5
clean up code
Sep 15, 2021
95fd61a
copy gradients from host to device
Sep 15, 2021
285c2d6
support bagging without using subset
Sep 15, 2021
1a09c19
add support of bagging with subset for CUDAColumnData
Sep 17, 2021
740f853
add support of bagging with subset for dense CUDARowData
Sep 18, 2021
f42e87e
refactor copy sparse subrow
Sep 24, 2021
0b9ca24
use copy subset for column subset
Sep 26, 2021
9a94240
add reset train data and reset config for CUDA tree learner
Sep 26, 2021
1f6dd90
add USE_CUDA ifdef to cuda tree learner files
Sep 26, 2021
4ca7586
check that dataset doesn't contain CUDA tree learner
Sep 26, 2021
25f57e3
remove printf debug information
Sep 26, 2021
12794b0
use full new cuda tree learner only when using single GPU
Sep 26, 2021
44e47ec
Merge branch 'master' of https://github.com/microsoft/LightGBM into c…
Sep 26, 2021
7e18687
disable all CUDA code when using CPU version
Sep 27, 2021
469e992
recover main.cpp
Sep 27, 2021
f2812c8
add cpp files for multi value bins
Sep 30, 2021
8e884b2
update LightGBM.vcxproj
Sep 30, 2021
9b9a63c
update LightGBM.vcxproj
Sep 30, 2021
e0c9f6f
fix lint errors
Sep 30, 2021
3bba6d7
fix lint errors
Sep 30, 2021
8f9f03e
update Makevars
Sep 30, 2021
01d772d
fix the case with 0 feature and 0 bin
Oct 8, 2021
e57dd15
fix lint errors
Oct 8, 2021
a5b9f7a
recover default device type to cpu
Oct 8, 2021
5f03d45
fix na_as_missing case
Oct 9, 2021
b2aaa9f
fix UpdateDataIndexToLeafIndexKernel
shiyu1994 Oct 15, 2021
0726d87
create CUDA trees when needed in CUDADataPartition::UpdateTrainScore
shiyu1994 Oct 15, 2021
1dea6bc
add refit by tree for cuda tree learner
shiyu1994 Oct 21, 2021
14b9ce9
fix test_refit in test_engine.py
Oct 21, 2021
4b936de
create set of large bin partitions in CUDARowData
Oct 21, 2021
4193768
add histogram construction for columns with a large number of bins
shiyu1994 Oct 23, 2021
0b6e79e
add find best split for categorical features on CUDA
shiyu1994 Oct 26, 2021
25f20a7
add bitvectors for categorical split
shiyu1994 Oct 27, 2021
82c33e4
cuda data partition split for categorical features
shiyu1994 Oct 28, 2021
ca16070
fix split tree with categorical feature
shiyu1994 Oct 29, 2021
c8716f1
fix categorical feature splits
shiyu1994 Nov 4, 2021
4bcaa03
refactor cuda_data_partition.cu with multi-level templates
shiyu1994 Nov 5, 2021
536f603
refactor CUDABestSplitFinder by grouping task information into struct
shiyu1994 Nov 5, 2021
015e099
pre-allocate space for vector split_find_tasks_ in CUDABestSplitFinder
shiyu1994 Nov 8, 2021
4c260d2
fix misuse of reference
shiyu1994 Nov 8, 2021
89d8214
remove useless changes
shiyu1994 Nov 8, 2021
54bc66a
add support for path smoothing
shiyu1994 Nov 9, 2021
86e208a
virtual destructor for LightGBM::Tree
shiyu1994 Nov 9, 2021
d888f1d
fix overlapped cat threshold in best split infos
shiyu1994 Nov 12, 2021
5efe0fb
reset histogram pointers in data partition and spllit finder in Reset…
shiyu1994 Nov 17, 2021
559a569
merge with LightGBM/master
shiyu1994 Nov 17, 2021
0bb88fb
comment useless parameter
shiyu1994 Nov 17, 2021
0678d9a
fix reverse case when na is missing and default bin is zero
shiyu1994 Nov 18, 2021
26130d9
fix mfb_is_na and mfb_is_zero and is_single_feature_column
shiyu1994 Nov 18, 2021
d49e92a
remove debug log
shiyu1994 Nov 19, 2021
3214d68
fix cat_l2 when one-hot
shiyu1994 Nov 23, 2021
361d2b0
merge master
shiyu1994 Nov 23, 2021
85ea408
switch shared histogram size according to CUDA version
shiyu1994 Dec 1, 2021
2af0f5d
gpu_use_dp=true when cuda test
shiyu1994 Dec 1, 2021
d0a628f
revert modification in config.h
shiyu1994 Dec 1, 2021
e0018ea
fix setting of gpu_use_dp=true in .ci/test.sh
shiyu1994 Dec 1, 2021
e54b51a
fix linter errors
shiyu1994 Dec 1, 2021
541235f
fix linter error
shiyu1994 Dec 1, 2021
a2ead3c
recover main.cpp
shiyu1994 Dec 1, 2021
2a81af6
separate cuda_exp and cuda
shiyu1994 Dec 8, 2021
9881075
fix ci bash scripts
shiyu1994 Dec 8, 2021
52b1e88
add USE_CUDA_EXP flag
shiyu1994 Dec 14, 2021
09054a1
Merge branch 'master' into cuda-tree-learner-subset
shiyu1994 Dec 14, 2021
c2a0be8
switch off USE_CUDA_EXP
shiyu1994 Dec 24, 2021
0651cca
Merge remote-tracking branch 'LightGBM/master' into cuda-tree-learner…
shiyu1994 Dec 24, 2021
fbc3760
revert changes in python-packages
shiyu1994 Dec 24, 2021
c58635b
more careful separation for USE_CUDA_EXP
shiyu1994 Dec 24, 2021
93d5950
fix CUDARowData::DivideCUDAFeatureGroups
shiyu1994 Jan 3, 2022
9f6aa8a
revert config.h
shiyu1994 Jan 3, 2022
12d8161
fix test settings for cuda experimental version
shiyu1994 Jan 3, 2022
354845e
skip some tests due to unsupported features or differences in impleme…
shiyu1994 Jan 4, 2022
cb49dd1
fix lint issue by adding a blank line
shiyu1994 Jan 4, 2022
2e2c696
fix lint errors by resorting imports
shiyu1994 Jan 4, 2022
3433674
fix lint errors by resorting imports
shiyu1994 Jan 4, 2022
0c94bdd
Merge branch 'master' into cuda-tree-learner-subset
shiyu1994 Jan 4, 2022
c72d555
fix lint errors by resorting imports
shiyu1994 Jan 4, 2022
63a9dc1
merge cuda.yml and cuda_exp.yml
shiyu1994 Jan 5, 2022
31ac33b
update python version in cuda.yml
shiyu1994 Jan 5, 2022
5f1f38d
remove cuda_exp.yml
shiyu1994 Jan 5, 2022
ba22deb
remove unrelated changes
shiyu1994 Jan 6, 2022
b008424
fix compilation warnings
shiyu1994 Feb 16, 2022
fad4b91
resolve conflicts with master
shiyu1994 Feb 21, 2022
55a94b5
Merge branch 'cuda-tree-learner-subset' of https://github.com/shiyu19…
shiyu1994 Feb 21, 2022
d77dd23
recover task
shiyu1994 Feb 22, 2022
6a9d530
use multi-level template in histogram construction
shiyu1994 Feb 22, 2022
4adca58
ignore NVCC related lines in parameter_generator.py
shiyu1994 Feb 22, 2022
8d99b2b
Merge remote-tracking branch 'LightGBM/master' into cuda-tree-learner…
shiyu1994 Feb 23, 2022
1e23342
update job name for CUDA tests
shiyu1994 Feb 23, 2022
f44b881
apply review suggestions
shiyu1994 Mar 8, 2022
d7b65c4
Update .github/workflows/cuda.yml
shiyu1994 Mar 9, 2022
a6a51fd
Update .github/workflows/cuda.yml
shiyu1994 Mar 9, 2022
9135582
update header
shiyu1994 Mar 9, 2022
cd101ae
remove useless TODOs
shiyu1994 Mar 9, 2022
9af98ac
remove [TODO(shiyu1994): constrain the split with min_data_in_group] …
shiyu1994 Mar 9, 2022
e34fcce
#include <LightGBM/utils/log.h> for USE_CUDA_EXP only
shiyu1994 Mar 9, 2022
499639d
fix include order
shiyu1994 Mar 9, 2022
6fe4874
fix include order
shiyu1994 Mar 9, 2022
3cf4c74
remove extra space
shiyu1994 Mar 9, 2022
34fdfe4
address review comments
shiyu1994 Mar 15, 2022
3bb91ae
add warning when cuda_exp is used together with deterministic
shiyu1994 Mar 15, 2022
e47d009
add comment about gpu_use_dp in .ci/test.sh
shiyu1994 Mar 21, 2022
53430dd
revert changing order of included headers
shiyu1994 Mar 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion .ci/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ else # Linux
mv $AMDAPPSDK_PATH/lib/x86_64/sdk/* $AMDAPPSDK_PATH/lib/x86_64/
echo libamdocl64.so > $OPENCL_VENDOR_PATH/amdocl64.icd
fi
if [[ $TASK == "cuda" ]]; then
if [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get update
apt-get install --no-install-recommends -y \
Expand Down
32 changes: 26 additions & 6 deletions .ci/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -190,21 +190,41 @@ if [[ $TASK == "gpu" ]]; then
elif [[ $METHOD == "source" ]]; then
cmake -DUSE_GPU=ON -DOpenCL_INCLUDE_DIR=$AMDAPPSDK_PATH/include/ ..
fi
elif [[ $TASK == "cuda" ]]; then
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
elif [[ $TASK == "cuda" || $TASK == "cuda_exp" ]]; then
if [[ $TASK == "cuda" ]]; then
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
else
sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda_exp";/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'std::string device_type = "cuda_exp"' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
# by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' $BUILD_DIRECTORY/include/LightGBM/config.h
grep -q 'gpu_use_dp = true' $BUILD_DIRECTORY/include/LightGBM/config.h || exit -1 # make sure that changes were really done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to handle this here:

LightGBM/src/io/config.cpp

Lines 342 to 346 in d130bb1

// force gpu_use_dp for CUDA
if (device_type == std::string("cuda") && !gpu_use_dp) {
Log::Warning("CUDA currently requires double precision calculations.");
gpu_use_dp = true;
}

Also, we can avoid if/else complicatation by using $TASK env variable as a value for device_type config value and Python-package installation flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I'll handle these unresolved comments today.

Copy link
Collaborator Author

@shiyu1994 shiyu1994 Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to handle this here

I'm not sure whether I understand your idea correctly. The old CUDA version only supports double precision training. But the new CUDA version will support both double precision and single precision training. Users can specify the mode through gpu_use_dp. We use single precision training in new CUDA version by default because it is faster without hurting the accuracy. However, to ensure that results are identical to that on CPU (which uses double precision histograms), we need to switch to double precision training in CI test. That's why we need to replace the default gpu_use_dp setting here in test.sh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can avoid if/else complicatation by using $TASK env variable as a value for device_type config value and Python-package installation flag.

After trying, I found that it will further complicate the bash code since single quotes will treat everything inside literally. And the python options cuda-exp is not identical with device type cuda_exp (We use - instead of _ for consistency with other python build options).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use single precision training in new CUDA version by default because it is faster without hurting the accuracy. However, to ensure that results are identical to that on CPU (which uses double precision histograms), we need to switch to double precision training in CI test.

OK, I got it. Thanks for the explanation! Please add a short comment explaining why we need to change gpu_use_dp param here, in CI files. It wasn't obvious for me, so I thought that single precision isn't supported, just like it is in our current CUDA implementation. However, it's not true according to your comments.

After trying, I found that it will further complicate the bash code since single quotes will treat everything inside literally.

OK, agree with you. Thanks for trying that!

fi
if [[ $METHOD == "pip" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py sdist || exit -1
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
if [[ $TASK == "cuda" ]]; then
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda || exit -1
else
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER.tar.gz -v --install-option=--cuda-exp || exit -1
fi
pytest $BUILD_DIRECTORY/tests/python_package_test || exit -1
exit 0
elif [[ $METHOD == "wheel" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
if [[ $TASK == "cuda" ]]; then
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda || exit -1
else
cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --cuda-exp || exit -1
fi
pip install --user $BUILD_DIRECTORY/python-package/dist/lightgbm-$LGB_VER*.whl -v || exit -1
pytest $BUILD_DIRECTORY/tests || exit -1
exit 0
elif [[ $METHOD == "source" ]]; then
cmake -DUSE_CUDA=ON ..
if [[ $TASK == "cuda" ]]; then
cmake -DUSE_CUDA=ON ..
else
cmake -DUSE_CUDA_EXP=ON ..
fi
fi
elif [[ $TASK == "mpi" ]]; then
if [[ $METHOD == "pip" ]]; then
Expand Down
15 changes: 14 additions & 1 deletion .github/workflows/cuda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ env:

jobs:
test:
name: cuda ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
name: ${{ matrix.tree_learner }} ${{ matrix.cuda_version }} ${{ matrix.method }} (linux, ${{ matrix.compiler }}, Python ${{ matrix.python_version }})
runs-on: [self-hosted, linux]
timeout-minutes: 60
strategy:
Expand All @@ -27,14 +27,27 @@ jobs:
compiler: gcc
python_version: "3.8"
cuda_version: "11.5.1"
tree_learner: cuda
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
tree_learner: cuda
- method: wheel
compiler: gcc
python_version: "3.10"
cuda_version: "9.0"
tree_learner: cuda
- method: source
compiler: gcc
python_version: "3.8"
cuda_version: "11.5.1"
tree_learner: cuda_exp
- method: pip
compiler: clang
python_version: "3.9"
cuda_version: "10.0"
tree_learner: cuda_exp
steps:
- name: Setup or update software on host machine
run: |
Expand Down
34 changes: 27 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ option(USE_SWIG "Enable SWIG to generate Java API" OFF)
option(USE_HDFS "Enable HDFS support (EXPERIMENTAL)" OFF)
option(USE_TIMETAG "Set to ON to output time costs" OFF)
option(USE_CUDA "Enable CUDA-accelerated training (EXPERIMENTAL)" OFF)
option(USE_CUDA_EXP "Enable CUDA-accelerated training with more acceleration (EXPERIMENTAL)" OFF)
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
option(USE_DEBUG "Set to ON for Debug mode" OFF)
option(USE_SANITIZER "Use santizer flags" OFF)
set(
Expand All @@ -28,7 +29,7 @@ if(__INTEGRATE_OPENCL)
cmake_minimum_required(VERSION 3.11)
elseif(USE_GPU OR APPLE)
cmake_minimum_required(VERSION 3.2)
elseif(USE_CUDA)
elseif(USE_CUDA OR USE_CUDA_EXP)
cmake_minimum_required(VERSION 3.16)
else()
cmake_minimum_required(VERSION 3.0)
Expand Down Expand Up @@ -133,7 +134,7 @@ else()
add_definitions(-DUSE_SOCKET)
endif()

if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
set(CMAKE_CUDA_HOST_COMPILER "${CMAKE_CXX_COMPILER}")
enable_language(CUDA)
set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
Expand Down Expand Up @@ -171,8 +172,12 @@ if(__INTEGRATE_OPENCL)
endif()
endif()

if(USE_CUDA)
find_package(CUDA 9.0 REQUIRED)
if(USE_CUDA OR USE_CUDA_EXP)
if(USE_CUDA)
find_package(CUDA 9.0 REQUIRED)
else()
find_package(CUDA 10.0 REQUIRED)
endif()
include_directories(${CUDA_INCLUDE_DIRS})
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=${OpenMP_CXX_FLAGS} -Xcompiler=-fPIC -Xcompiler=-Wall")

Expand All @@ -199,7 +204,12 @@ if(USE_CUDA)
endif()
message(STATUS "CMAKE_CUDA_FLAGS: ${CMAKE_CUDA_FLAGS}")

add_definitions(-DUSE_CUDA)
if(USE_CUDA)
add_definitions(-DUSE_CUDA)
elseif(USE_CUDA_EXP)
add_definitions(-DUSE_CUDA_EXP)
endif()

if(NOT DEFINED CMAKE_CUDA_STANDARD)
set(CMAKE_CUDA_STANDARD 11)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
Expand Down Expand Up @@ -369,9 +379,17 @@ file(
src/objective/*.cpp
src/network/*.cpp
src/treelearner/*.cpp
if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
src/treelearner/*.cu
endif()
if(USE_CUDA_EXP)
src/treelearner/cuda/*.cpp
src/treelearner/cuda/*.cu
src/io/cuda/*.cu
src/io/cuda/*.cpp
src/cuda/*.cpp
src/cuda/*.cu
endif()
)

add_library(lightgbm_objs OBJECT ${SOURCES})
Expand Down Expand Up @@ -493,14 +511,16 @@ if(__INTEGRATE_OPENCL)
target_link_libraries(lightgbm_objs PUBLIC ${INTEGRATED_OPENCL_LIBRARIES})
endif()

if(USE_CUDA)
if(USE_CUDA OR USE_CUDA_EXP)
# Disable cmake warning about policy CMP0104. Refer to issue #3754 and PR #4268.
# Custom target properties does not propagate, thus we need to specify for
# each target that contains or depends on cuda source.
set_target_properties(lightgbm_objs PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(_lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)
set_target_properties(lightgbm PROPERTIES CUDA_ARCHITECTURES OFF)

set_target_properties(lightgbm_objs PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

# Device linking is not supported for object libraries.
# Thus we have to specify them on final targets.
set_target_properties(lightgbm PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
Expand Down
4 changes: 4 additions & 0 deletions R-package/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ OBJECTS = \
io/parser.o \
io/train_share_states.o \
io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \
metric/metric.o \
objective/objective_function.o \
Expand Down
4 changes: 4 additions & 0 deletions R-package/src/Makevars.win.in
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ OBJECTS = \
io/parser.o \
io/train_share_states.o \
io/tree.o \
io/dense_bin.o \
io/sparse_bin.o \
io/multi_val_dense_bin.o \
io/multi_val_sparse_bin.o \
metric/dcg_calculator.o \
metric/metric.o \
objective/objective_function.o \
Expand Down
2 changes: 2 additions & 0 deletions docs/Installation-Guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,8 @@ To build LightGBM CUDA version, run the following commands:
cmake -DUSE_CUDA=1 ..
make -j4

Recently, a new CUDA version with better efficiency is implemented as an experimental feature. To build the new CUDA version, replace ``-DUSE_CUDA`` with ``-DUSE_CUDA_EXP`` in the above commands. Please note that new version requires **CUDA** 10.0 or later libraries.

**Note**: glibc >= 2.14 is required.

**Note**: In some rare cases you may need to install OpenMP runtime library separately (use your package manager and search for ``lib[g|i]omp`` for doing this).
Expand Down
6 changes: 5 additions & 1 deletion docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ Core Parameters

- **Note**: please **don't** change this during training, especially when running multiple jobs simultaneously by external packages, otherwise it may cause undesirable errors

- ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, aliases: ``device``
- ``device_type`` :raw-html:`<a id="device_type" title="Permalink to this parameter" href="#device_type">&#x1F517;&#xFE0E;</a>`, default = ``cpu``, type = enum, options: ``cpu``, ``gpu``, ``cuda``, ``cuda_exp``, aliases: ``device``

- device for the tree learning, you can use GPU to achieve the faster learning

Expand All @@ -209,6 +209,10 @@ Core Parameters

- **Note**: refer to `Installation Guide <./Installation-Guide.rst#build-gpu-version>`__ to build LightGBM with GPU support

- **Note**: ``cuda_exp`` is an experimental CUDA version, the installation guide for ``cuda_exp`` is identical with ``cuda``

- **Note**: ``cuda_exp`` is faster than ``cuda`` and will replace ``cuda`` in the future

- ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``

- this seed is used to generate other seeds, e.g. ``data_random_seed``, ``feature_fraction_seed``, etc.
Expand Down
2 changes: 2 additions & 0 deletions helpers/parameter_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ def get_parameter_infos(
member_infos: List[List[Dict[str, List]]] = []
with open(config_hpp) as config_hpp_file:
for line in config_hpp_file:
if line.strip() in {"#ifndef __NVCC__", "#endif // __NVCC__"}:
continue
if "#pragma region Parameters" in line:
is_inparameter = True
elif "#pragma region" in line and "Parameters" in line:
Expand Down
29 changes: 29 additions & 0 deletions include/LightGBM/bin.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,23 @@ class BinMapper {
}
}

/*!
* \brief Maximum categorical value
* \return Maximum categorical value for categorical features, 0 for numerical features
*/
inline int MaxCatValue() const {
if (bin_2_categorical_.size() == 0) {
return 0;
}
int max_cat_value = bin_2_categorical_[0];
for (size_t i = 1; i < bin_2_categorical_.size(); ++i) {
if (bin_2_categorical_[i] > max_cat_value) {
max_cat_value = bin_2_categorical_[i];
}
}
return max_cat_value;
}

/*!
* \brief Get sizes in byte of this object
*/
Expand Down Expand Up @@ -379,6 +396,10 @@ class Bin {
* \brief Deep copy the bin
*/
virtual Bin* Clone() = 0;

virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, std::vector<BinIterator*>* bin_iterator, const int num_threads) const = 0;

virtual const void* GetColWiseData(uint8_t* bit_type, bool* is_sparse, BinIterator** bin_iterator) const = 0;
};


Expand Down Expand Up @@ -452,6 +473,14 @@ class MultiValBin {
static constexpr double multi_val_bin_sparse_threshold = 0.25f;

virtual MultiValBin* Clone() = 0;

#ifdef USE_CUDA_EXP
virtual const void* GetRowWiseData(uint8_t* bit_type,
size_t* total_size,
bool* is_sparse,
const void** out_data_ptr,
uint8_t* data_ptr_bit_type) const = 0;
#endif // USE_CUDA_EXP
};

inline uint32_t BinMapper::ValueToBin(double value) const {
Expand Down