Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FindNVTX.cmake #3421

Merged
merged 4 commits into from Feb 28, 2022
Merged

Fix FindNVTX.cmake #3421

merged 4 commits into from Feb 28, 2022

Conversation

maxhgerlach
Copy link
Collaborator

This fixes a bug that I've introduced with #3261.

NVTX would not be found when building with HOROVOD_CUDA_HOME to locate CUDA.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@@ -10,7 +10,7 @@

set(HOROVOD_NVTX_INCLUDE $ENV{HOROVOD_NVTX_INCLUDE} CACHE PATH "Folder containing NVIDIA NVTX3 headers")

list(APPEND NVTX_ROOT ${CUDAToolkit_LIBRARY_ROOT})
list(APPEND NVTX_ROOT ${CUDAToolkit_INCLUDE_DIRS})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to put ${CUDAToolkit_INCLUDE_DIRS} on line 19 as an extra hint to search include?
CMAKE_PREFIX_PATH takes the root dir and append bin, lib, or include for the appropriate search: https://cmake.org/cmake/help/latest/variable/CMAKE_PREFIX_PATH.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my system, setting NVTX_ROOT properly seems to have already done the trick.

Would you say that adding this extra hint would help even more (on other systems)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CMake 3.12+ (we require 3.13 now), setting <package>_ROOT correctly should be sufficient for find_path: https://cmake.org/cmake/help/latest/command/find_path.html

Copy link
Collaborator

@nvcastet nvcastet Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It potentially ends up to be the same but CMAKE_PREFIX_PATH is usually used for the parent (root) directory instead of the subdir include. <PackageName>_ROOT is used in the same way. Specific subdir (bin/lib/include) hints are usually added to the corresponding find_* HINTS argument.
On my container CUDAToolkit_LIBRARY_ROOT is empty, which is weird, so relying on CUDAToolkit_INCLUDE_DIRS instead seems to be a good idea, but maybe it is set in some cases?
I would just have added CUDAToolkit_INCLUDE_DIRS to the find_path:

find_path(NVTX_INCLUDE_DIR
          NAMES nvtx3/nvToolsExt.h
          HINTS ${HOROVOD_NVTX_INCLUDE} ${CUDAToolkit_INCLUDE_DIRS})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end as you mentioned it works in both cases and here we seem to only need to search for just the include of nvtx. Out of curiosity, do you know why we don't need to find the nvtx lib too libnvToolsExt.so?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, understood. It would probably be clearer not to plug CUDAToolkit_INCLUDE_DIRS into NVTX_ROOT (although it works for us as we don't need the .so).

Out of curiosity, do you know why we don't need to find the nvtx lib too libnvToolsExt.so?

The Nsight Systems User Guide only speaks of the header file: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#nvtx-trace Might it be necessary to explicitly link that library for use cases where one does not attach that profiler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we are using the nvtx header-only library, so the shared lib is never needed:
https://gitlab.kitware.com/cmake/cmake/-/issues/21377

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with your suggestion, @nvcastet, there's no need to deal with the NVTX_ROOT variable at all here.

@EnricoMi
Copy link
Collaborator

Have you checked all the other places where CUDA_TOOLKIT_ROOT_DIR has been replaced with CUDAToolkit_LIBRARY_ROOT in #3261 while CUDAToolkit_INCLUDE_DIRS would have been the right one?

@maxhgerlach
Copy link
Collaborator Author

@EnricoMi, very good point. I will give the other replacements a look. Most likely they were false, too.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

Improved two other places where I had thought CUDAToolkit_LIBRARY_ROOT would be a viable replacement for CUDA_TOOLKIT_ROOT_DIR.

@EnricoMi
Copy link
Collaborator

Improved two other places where I had thought CUDAToolkit_LIBRARY_ROOT would be a viable replacement for CUDA_TOOLKIT_ROOT_DIR.

Then I am happy with this, I leave this to @nvcastet to approve

@nvcastet
Copy link
Collaborator

Assuming CI passes, LGTM.

@github-actions
Copy link

Unit Test Results

     773 files  ±0       773 suites  ±0   8h 47m 36s ⏱️ - 19m 23s
     722 tests ±0       661 ✔️  - 14       47 💤 ±0  14 +14 
16 693 runs  ±0  11 761 ✔️  - 15  4 917 💤 ±0  15 +15 

For more details on these failures, see this check.

Results for commit fc3ebe0. ± Comparison against base commit 046c071.

@github-actions
Copy link

Unit Test Results (with flaky tests)

     951 files  +     76       951 suites  +76   11h 2m 7s ⏱️ + 58m 22s
     722 tests ±       0       659 ✔️  -      14       47 💤 ±    0  16 +14 
20 962 runs  +1 898  14 564 ✔️ +1 240  6 347 💤 +610  51 +48 

For more details on these failures, see this check.

Results for commit fc3ebe0. ± Comparison against base commit 046c071.

@maxhgerlach
Copy link
Collaborator Author

CI passes besides known failures with head framework versions (Spark Lightning, TF batchnorm, TF hvd.join on GPU)

@maxhgerlach maxhgerlach merged commit 79ded4b into horovod:master Feb 28, 2022
@maxhgerlach maxhgerlach deleted the fix-findnvtx branch February 28, 2022 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants