Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to CMake 3.13 for better CUDA support and to enable build concurrency #3261

Merged
merged 19 commits into from Jan 17, 2022

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Nov 5, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR replaces the build process for Horovod's CUDA kernels by one relying on features offered in recent versions of CMake. In particular, the deprecated module FindCUDA is replaced by CMake's first-class CUDA language support and the more modern module FindCUDAToolkit. This fixes the race condition of #2543 and allows us to re-enable build concurrency via -j, which will certainly be appreciated in many places.

To ensure that these new features are available I bumped up the minimum required version to 3.18 3.13. I believe that this should not be a big problem even on older systems as a recent CMake can usually be obtained easily via pip install cmake.

Edit: By shipping a module based on FindCUDAToolkit from CMake 3.17.5 we can build with CMake >= 3.13.

I am not an expert with CMake by any means, so any feedback would be more than welcome!

Fixes #2543.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@maxhgerlach maxhgerlach changed the title Update to CUDA support from CMake 3.18 and enable build concurrency Update to CMake 3.18 for better CUDA support and enable build concurrency Nov 5, 2021
@maxhgerlach maxhgerlach marked this pull request as draft November 5, 2021 17:17
@chongxiaoc
Copy link
Collaborator

Mention it related to #2543

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

Unit Test Results

     830 files  ±0       830 suites  ±0   9h 28m 2s ⏱️ + 30m 7s
     717 tests ±0       672 ✔️ ±  0       45 💤 ±  0  0 ±0 
17 988 runs  ±0  12 644 ✔️  - 14  5 344 💤 +14  0 ±0 

Results for commit 9f42442. ± Comparison against base commit 31bba3b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 5, 2021

Unit Test Results (with flaky tests)

     962 files  +  44       962 suites  +44   10h 51m 56s ⏱️ + 1h 11m 6s
     717 tests ±    0       666 ✔️  -     5       45 💤 ±  0  6 +5 
20 732 runs  +714  14 539 ✔️ +639  6 184 💤 +68  9 +7 

For more details on these failures, see this check.

Results for commit 9f42442. ± Comparison against base commit 31bba3b.

♻️ This comment has been updated with latest results.

@maxhgerlach maxhgerlach changed the title Update to CMake 3.18 for better CUDA support and enable build concurrency Update to CMake 3.18 for better CUDA support and to enable build concurrency Nov 5, 2021
@nvcastet
Copy link
Collaborator

nvcastet commented Nov 11, 2021

@maxhgerlach: To solve the build concurrency, if i remember correctly the problem was not the cmake version. It is related to the fact we build 2 versions of the same library which is causing issues with the intermediate files that get overwritten when the 2 versions are built concurrently.
I have not had the chance to make a PR for it. But the fix is easy, it is just adding a build dependency between horovod_cuda_kernels and compatible_horovod_cuda_kernels. That solved the problem when I tested the statibility of the build in a forloop.

@nvcastet
Copy link
Collaborator

nvcastet commented Nov 11, 2021

I also think it is good idea to update the cmake version too. :)
FYI, just for comparison, pytorch has a minimum of 3.10 https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L1
I don't have a list of the latest cmake supported on the different package managers (pip, conda...) with different architectures and with different OSs. But for sure we probably do not want to block someone not having the good version of cmake in their production environment since our compilation happens at install time.

@nvcastet
Copy link
Collaborator

nvcastet commented Nov 11, 2021

@maxhgerlach For build concurrency, just adding

add_dependencies(compatible_horovod_cuda_kernels horovod_cuda_kernels)

will do the trick.

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Nov 17, 2021

Hi @nvcastet, thanks for your comments!

To solve the build concurrency, if i remember correctly the problem was not the cmake version. It is related to the fact we build 2 versions of the same library which is causing issues with the intermediate files that get overwritten when the 2 versions are built concurrently.

I was under the impression that this race condition was ultimately caused by a bug in CMake's FindCUDA module (needed for the two cuda_add_library() calls in master), which has been deprecated for a while now. After some analysis @Flamefire summarized the situation in #2543 (comment) and also reported the issue upstream at https://gitlab.kitware.com/cmake/cmake/-/issues/21623.

For Horovod 1.0 I think it would be beneficial to cut the ties to this deprecated module, which may or may not work correctly in various versions of CMake, and instead move on to CMake's first-class CUDA language support and fix the race condition at the same time (as proposed by this draft PR). enable_language(CUDA) was already introduced with CMake 3.8, but the FindCUDA module could not be dropped completely then because of some ancillary functionality. The replacement for that (finding CUDA include and library directories for non-CUDA targets etc.) is now provided by FindCUDAToolkit, which came with CMake 3.17.

I also think it is good idea to update the cmake version too. :)
FYI, just for comparison, pytorch has a minimum of 3.10 https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L1
I don't have a list of the latest cmake supported on the different package managers (pip, conda...) with different architectures and with different OSs. But for sure we probably do not want to block someone not having the good version of cmake in their production environment since our compilation happens at install time.

If we don't want to require a version quite as recent as 3.18 or 3.17, we may also get away with packaging just that FindCUDAToolkit module and requiring some intermediate version of CMake >= 3.8 (as suggested by @leezu in #2543 (comment)). What do you think?

@nvcastet
Copy link
Collaborator

Thanks @maxhgerlach! I missed @Flamefire's comment on cuda_add_library() being buggy when used on several versions of the same library and FindCUDAToolkit fixing the issue. In that case, I agree upgrading cmake is better than adding the extra build dependency between the libraries.
Thanks a lot for the thorough explanation and the PR to upgrade cmake.
I found the list of cmake versions that come by default on the different OSs: https://gitlab.kitware.com/cmake/community/-/wikis/CMake-Versions-on-Linux-Distros
I agree that users can easily get a newer version via pip or conda. So jumping to 3.18 may not be an issue.
@leezu Do you know why MXNet does not move to cmake 3.17 or 3.18 to get FindCUDAToolkit directly from cmake?
@tgaddair What are your thoughts on the cmake version?

@maxhgerlach maxhgerlach marked this pull request as ready for review November 19, 2021 09:32
@maxhgerlach
Copy link
Collaborator Author

The only tests that still fail are for torchhead and mxnethead and these issues appear to have been fixed on master. Apart from that all the builds and tests have run fine in the CI now.

The eightfold build concurrency seems to have shaved off a few minutes from the docker build times, but I'm not sure how comparable these are between Github actions workflow runs at different times.

From the table that @nvcastet linked to, requiring only (say) CMake 3.10 instead of 3.18 would enable people to build with the standard package sources of these distros:

  • Red Hat 8 (2019)
  • openSUSE 15.1 (2019)
  • Ubuntu 18.04 (2018)
  • Debian 10 (2019)

Apparently even Ubuntu 20.04 only comes with CMake 3.16 and would require users to add a more recent extra package.

Then again it's really pretty easy to get a recent CMake via pip, conda, snap, or a PPA or similar (and many C++ projects require this).

Anyway, if we find it worthwhile to lower the version requirement somewhat from 3.18, I'd be willing to look into it, but it might take some time.

@EnricoMi
Copy link
Collaborator

This is awesome! Build time in GitHub Actions sadly do not benefit much from the concurrency as the workers have only two cores. But users definitively benefit from this. 🎉

@maxhgerlach
Copy link
Collaborator Author

Good point regarding the number of cores available to the workers, @EnricoMi! A default MAKEFLAGS=-j8 still seems to work fine, though, even if that's more processes than the VMs can use effectively.

When I set unlimited scaling with -j, I had weird problems with hanging builds and disappearing logs (this run https://github.com/horovod/horovod/runs/4243022771), so probably some effective limit to memory or ... was exceeded. I had assumed that make -j would not schedule more processes than the number of available CPU threads, but a brief look into man make just proved me wrong.

@EnricoMi
Copy link
Collaborator

Interesting default imposed by make. You could set MAKEFLAGS=-j2 in our ci.yaml if the Horovod default of -j8 is a problem for GitHub.

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 16, 2021

After rebasing to master, enable_language(CUDA) does not work anymore on the ppc64le Jenkins worker.

I would get an "error: identifier "__ieee128" is undefined". This appears to be a bug with GCC 8+ and CUDA 10. See LLNL/blt#341 (comment) I decided to disable quadruple precision there via -mno-float128, which shouldn't be a issue for Horovod.

libstdc++ with gcc 8.2, however, has a bug that prevents compilation with -mno-float128: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84654
We would need to upgrade to 8.3 at least. However, I didn't manage to do that via conda in our Jenkinsfile. Trying to downgrade to 7.3 also hangs for a while, then fails with lots of conflicts (edit) https://powerci.osuosl.org/job/Horovod_PPC64LE_GPU_PIPELINE/view/change-requests/job/PR-3261/13/console

@nvcastet, would you know how to easily upgrade or downgrade the compiler in that Docker container?

@nvcastet
Copy link
Collaborator

Most of the packages for the OpenCE release are built with 8.2.0:
https://github.com/open-ce/horovod-feedstock/blob/main/config/conda_build_config.yaml
So it would be great to stay on a matching version and we don't want to break their infra.
@maxhgerlach What are the settings that change when building with enable_language(CUDA) (compiler flags?) vs cmake we currently use (where ppc64le builds fine)?
@npanpaliya Any thought?

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 16, 2021

OK, then it will make sense to stick with gcc 8.2 and search some other workaround. 🙂

What are the settings that change when building with enable_language(CUDA) (compiler flags?) vs cmake we currently use (where ppc64le builds fine)?

With enable_language(CUDA) CMake appears to compile a test program CMakeCUDACompilerId.cu (likely generated from CMakeCUDACompilerId.cu.in) and this fails in the current ppc64le container (copied from https://powerci.osuosl.org/job/Horovod_PPC64LE_GPU_PIPELINE/view/change-requests/job/PR-3261/8/console):

      #$ "/opt/anaconda3/envs/wmlce/bin"/powerpc64le-conda_cos7-linux-gnu-c++
      -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__
      -D__NVCC__ "-I/usr/local/cuda/bin/../targets/ppc64le-linux/include"
      -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2
      -D__CUDACC_VER_BUILD__=89 -include "cuda_runtime.h"
      "CMakeCUDACompilerId.cu" -o "tmp/CMakeCUDACompilerId.cpp1.ii"

      #$ cicc --c++14 --gnu_version=80200 --allow_managed --unsigned_chars -arch
      compute_30 -m64 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name
      "CMakeCUDACompilerId.fatbin.c" -tused -nvvmir-library
      "/usr/local/cuda/bin/../nvvm/libdevice/libdevice.10.bc"
      --gen_module_id_file --module_id_file_name
      "tmp/CMakeCUDACompilerId.module_id" --orig_src_file_name
      "CMakeCUDACompilerId.cu" --gen_c_file_name
      "tmp/CMakeCUDACompilerId.cudafe1.c" --stub_file_name
      "tmp/CMakeCUDACompilerId.cudafe1.stub.c" --gen_device_file_name
      "tmp/CMakeCUDACompilerId.cudafe1.gpu" "tmp/CMakeCUDACompilerId.cpp1.ii" -o
      "tmp/CMakeCUDACompilerId.ptx"


      /opt/anaconda3/envs/wmlce/powerpc64le-conda_cos7-linux-gnu/include/c++/8.2.0/type_traits(335):
      error: identifier "__ieee128" is undefined

(Before the last update of the ppc64le Jenkins we were on gcc 7.3, which doesn't trigger this bug.)

I think the --c++14 in the second command line might explain why we don't see the issue when we build Horovod's CUDA kernels. Those are still compiled in C++11 mode. So sneaking in a -std=c++11 might help here!

@maxhgerlach maxhgerlach force-pushed the update-cmake branch 2 times, most recently from 5d758a5 to 7d55479 Compare December 17, 2021 08:56
@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 17, 2021

Putting -std=c++11 into CMAKE_CUDA_FLAGS has indeed fixed the ppc64le build.

@maxhgerlach
Copy link
Collaborator Author

The latest test failure appears to be something related to Ray on Buildkite and is probably not caused by this PR.

Copy link
Collaborator

@nvcastet nvcastet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks Max for the PR!

if (CMAKE_CUDA_COMPILER)
if ((CMAKE_CXX_COMPILER_ID MATCHES GNU) AND (CMAKE_SYSTEM_PROCESSOR MATCHES ppc64le))
if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 8.0)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -std=c++11")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that needed if in horovod/common/ops/cuda/CMakeLists.txt, you already set:

set(CMAKE_CUDA_STANDARD 11)

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nvcastet, thanks for the review and all the advice earlier!

The -std=c++11 flag here is for enable_language(CUDA) in this top-level CMakeLists.txt. CMake will apparently compile a test program at that point to gauge whether the compiler is really set up correctly etc. That fails on ppc64le with our versions of gcc and CUDA, however, because of a float128-related bug and one way to circumvent that is to disable C++14 support. I tried set(CMAKE_CUDA_STANDARD 11) first to achieve that, but that setting is apparently ignored at this stage, so we got the same error: Jenkins log, intermediate commit. In contrast, CMAKE_CUDA_FLAGS is not ignored there. I don't know why.

From https://cliutils.gitlab.io/modern-cmake/chapters/packages/CUDA.html:

    Unlike the older languages, CUDA support has been rapidly
    evolving, and building CUDA is hard, so I would recommend you
    require a very recent version of CMake! CMake 3.17 and 3.18 have a
    lot of improvements directly targeting CUDA.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…indCUDAToolkit

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Else build arg NCCL_VERSION does not override env variable from base container.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
This appears to be a bug with GCC 8+ and CUDA 10. It's mitigated
by not building with C++11.

Alternatively we could disable quadruple precision.
LLNL/blt#341 (comment)
However, libstdc++8 with gcc 8.2 has a bug preventing
compilation with-mno-float128.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84654

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
We achieve this by shipping a FindCUDAToolkit.cmake based
on CMake 3.17.5.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Version 3.13 seems to be unavailable via Kitware's apt repo and
the pip command line is easier anyway.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

Merging this now as overall feedback was positive.

I'll post a follow-up PR shortly to automatically install a recent CMake to a temporary location and use that to build Horovod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in CMake
6 participants