Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Horovod with temporarily installed CMake if necessary #3371

Merged
merged 7 commits into from Mar 1, 2022

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Jan 17, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This is a follow-up to PR #3261, which would require users of many distros to install a recent CMake from an external source (like pip).

If such a situation is identified in setup.py, we can automatically install a sufficiently recent CMake to a temporary directory and use that to build Horovod. That directory is cleaned up afterwards and the user or system environment is not affected. If HOROVOD_CMAKE is set, only that CMake binary is used without any change of behavior.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Jan 17, 2022

Unit Test Results

     858 files  +     56       858 suites  +56   9h 9m 0s ⏱️ + 28m 12s
     717 tests ±       0       672 ✔️ ±    0       45 💤 ±    0  0 ±0 
18 652 runs  +1 328  13 064 ✔️ +826  5 588 💤 +502  0 ±0 

Results for commit c9f0ad0. ± Comparison against base commit a5edcd0.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 17, 2022

Unit Test Results (with flaky tests)

     998 files  +   108       998 suites  +108   9h 58m 38s ⏱️ + 37m 44s
     717 tests ±       0       670 ✔️  -        1       45 💤 ±    0  2 +1 
21 490 runs  +2 060  14 735 ✔️ +1 158  6 753 💤 +901  2 +1 

For more details on these failures, see this check.

Results for commit c9f0ad0. ± Comparison against base commit a5edcd0.

♻️ This comment has been updated with latest results.

@@ -33,6 +33,7 @@ RUN add-apt-repository ppa:ubuntu-toolchain-r/test
RUN apt-get update -qq && apt-get install -y --no-install-recommends \
wget \
ca-certificates \
cmake \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could rely here on setup.py installing the latest cmake so we always test this really works

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this apt-get line currently installs cmake 3.10 on the Ubuntu 18.04 we use for testing (too old to build Horovod). So yes, as it stands, we rely on setup.py to install a temporary CMake 3.13 to build Horovod and the mechanism from this PR is tested. This would change once we update to Ubuntu 20.04 or later.

I left cmake in here for any non-Horovod builds in this Docker container, e.g., for oneCCL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and just for the record: The ppc64le build on Jenkins currently comes with a pretty recent CMake on its own, so using a recent system CMake to build Horovod is also still under test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, our ancient Ubuntu 18.04 means this apt-get intall cmake is not sufficient. Then it makes sense we need to add that extra code to the CodeQL yaml. Once we update to Ubuntu 20.04 we will lose testing that auto-install. We should remember when we upgrade to keep Ubuntu 18.04 for one of the test combinations, e.g. TF 1.15 to see that auto-install feature still being tested (as long as we want to support Ubuntu 18.04 in our test setup).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remember when we upgrade to keep Ubuntu 18.04 for one of the test combinations, e.g. TF 1.15 to see that auto-install feature still being tested (as long as we want to support Ubuntu 18.04 in our test setup).

That's a good plan!

https://github.com/horovod/horovod/runs/4852098212?check_suite_focus=true#step:10:4652

Yep, here we can see that it used a CMake binary from /tmp:

11.63 Could not find a recent CMake to build Horovod. Attempting to install CMake 3.13 to a temporary location via pip.
...
13.92 Running CMake in build/temp.linux-x86_64-3.7/RelWithDebInfo:
13.92 /tmp/horovod-cmake-tmpmj4n5l40/bin/run_cmake /tmp/pip-req-build-fzb3ontk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-req-build-fzb3ontk/build/lib.linux-x86_64-3.7 -DPYTHON_EXECUTABLE:FILEPATH=/usr/bin/python

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note: In the meantime we have moved our test images to Ubuntu 20.04, but a few test images still use Ubuntu 18.04, so we now test both.

@@ -100,6 +100,9 @@ jobs:
sed -i -e "s%^# setup ssh service$%${command//$'\n'/\\n}\n\n# setup ssh service%" Dockerfile.test.?pu

command=$(cat <<EOF
# Install recent CMake
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is installing cmake through apt-get installin the Dockerfiles not sufficient here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need CMake 3.13 to compile the C++ parts of Horovod. On Ubuntu 18.04 (and without extra package sources like a PPA from Kitware) apt-get install cmake installs CMake 3.10 IIRC.

The mechanism of this PR does not work with the C++ build for CodeQL because we don't use setup.py there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed again as we have moved to Ubuntu 20.04 for our test images in the meantime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this and also went back to installing CMake via apt in the Dockerfiles based on Ubuntu 20.04 (docker/*/Dockerfile).

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, leave it to @tgaddair and @nvcastet to approve.

@EnricoMi EnricoMi mentioned this pull request Mar 1, 2022
@@ -100,6 +100,9 @@ jobs:
sed -i -e "s%^# setup ssh service$%${command//$'\n'/\\n}\n\n# setup ssh service%" Dockerfile.test.?pu

command=$(cat <<EOF
# Install recent CMake
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed again as we have moved to Ubuntu 20.04 for our test images in the meantime.

@@ -33,6 +33,7 @@ RUN add-apt-repository ppa:ubuntu-toolchain-r/test
RUN apt-get update -qq && apt-get install -y --no-install-recommends \
wget \
ca-certificates \
cmake \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note: In the meantime we have moved our test images to Ubuntu 20.04, but a few test images still use Ubuntu 18.04, so we now test both.

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We ran into this issue as well using master, so I agree it would be good to get this into the release.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
(accidental remainder from previous PR)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
This tests that Horovod can be built with an automatically installed CMake from a temp dir.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 1, 2022

@maxhgerlach I have rebased this with master, lets see if we now get the tests terminate

@tgaddair tgaddair merged commit 2632c05 into horovod:master Mar 1, 2022
@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 2, 2022

@maxhgerlach looks like the last minute change to the horovod and horovod-cpu images broke Docker image generation. The GitHub action steps take forever (what I observed in this branch before merge, and which is the case for branches rebased to master since):

https://github.com/horovod/horovod/runs/5385819691?check_suite_focus=true
https://github.com/horovod/horovod/runs/5385981381?check_suite_focus=true

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 2, 2022

The change triggered apt-get to ask for input:

debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
Configuring tzdata
------------------

Please select the geographic area in which you live. Subsequent configuration
questions will narrow this down by presenting a list of cities, representing
the time zones in which they are located.

  1. Africa      4. Australia  7. Atlantic  10. Pacific  13. Etc
  2. America     5. Arctic     8. Europe    11. SystemV
  3. Antarctica  6. Asia       9. Indian    12. US
Geographic area: 

I have fixed that for other images already, will apply the fix here too.

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Mar 2, 2022

Great find, thanks @EnricoMi!

Somehow Github Actions doesn't show me any logs for the hanging steps. Do you know of any way to get them?

Edit: Not possible and it's been a known issue for 2+ years, oof. https://github.community/t/how-to-see-the-full-log-while-a-workflow-is-in-progress/17455
Note to self: Test Docker builds locally...

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 2, 2022

The full log becomes available after the job terminates. While a job is running, only new lines are shown.

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 2, 2022

Note to self: add timeout to docker build jobs ;-)

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Mar 2, 2022

While a job is running, only new lines are shown.

And the only way to get to see those appears to be to keep a tab open and to not let your computer go to sleep...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants