Allow stderr of executed cmake python code appear in logs #3398

EnricoMi · 2022-02-03T14:29:21Z

CMake executes Python code to detect installed frameworks. If that code fails, there is no trace in the logs, which makes identifying the cause very difficult. We only see it could not find the framework:

  #44 63.09     CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  #44 63.09       Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
  #44 63.09       version "1.4.0")
  #44 63.09     Call Stack (most recent call first):
  #44 63.09       /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  #44 63.09       cmake/Modules/FindMxnet.cmake:65 (find_package_handle_standard_args)
  #44 63.09       horovod/mxnet/CMakeLists.txt:12 (find_package)

This allows errors from the Python code to leak into the build log:

  #44 63.07     Traceback (most recent call last):
  #44 63.07       File "<string>", line 1, in <module>
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/__init__.py", line 23, in <module>
  #44 63.07         from .context import Context, current_context, cpu, gpu, cpu_pinned
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/context.py", line 20, in <module>
  #44 63.07         from .base import _LIB
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/base.py", line 293, in <module>
  #44 63.07         _LIB = _load_lib()
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/base.py", line 284, in _load_lib
  #44 63.07         lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  #44 63.07       File "/usr/lib/python3.8/ctypes/__init__.py", line 369, in __init__
  #44 63.07         self._handle = _dlopen(self._name, mode)
  #44 63.07     OSError: libcusolver.so.10: cannot open shared object file: No such file or directory
  #44 63.09     CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  #44 63.09       Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
  #44 63.09       version "1.4.0")
  #44 63.09     Call Stack (most recent call first):
  #44 63.09       /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  #44 63.09       cmake/Modules/FindMxnet.cmake:65 (find_package_handle_standard_args)
  #44 63.09       horovod/mxnet/CMakeLists.txt:12 (find_package)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

maxhgerlach · 2022-02-03T15:42:04Z

There was a related PR #3203, which also wanted to change logging levels, but I didn't really understand why.

github-actions · 2022-02-03T17:44:25Z

Unit Test Results

    801 files ±    0     801 suites ±0 9h 45m 57s ⏱️ + 28m 9s
    722 tests -     8     675 ✔️ -     4     47 💤 +    4 0 ❌ ±0
17 357 runs +278 12 182 ✔️ +136 5 175 💤 +158 0 ❌ ±0

Results for commit c66b7bd. ± Comparison against base commit 19040e9.

This pull request removes 8 tests.

test.parallel.test_adasum_tensorflow
test.parallel.test_keras
test.parallel.test_process_sets_multi_comm
test.parallel.test_process_sets_static
test.parallel.test_tensorflow
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_xla

This pull request skips 4 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

github-actions · 2022-02-03T17:44:38Z

Unit Test Results (with flaky tests)

    895 files -     94     895 suites - 94 10h 26m 7s ⏱️ - 1h 17m 20s
    722 tests -       8     673 ✔️ -       3     47 💤 +    4 2 ❌ - 1
19 424 runs - 1 519 13 488 ✔️ - 1 222 5 934 💤 - 242 2 ❌ - 7

For more details on these failures, see this check.

Results for commit c66b7bd. ± Comparison against base commit 19040e9.

This pull request removes 8 tests.

test.parallel.test_adasum_tensorflow
test.parallel.test_keras
test.parallel.test_process_sets_multi_comm
test.parallel.test_process_sets_static
test.parallel.test_tensorflow
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_xla

This pull request skips 4 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

EnricoMi · 2022-02-03T19:26:10Z

There was a related PR #3203, which also wanted to change logging levels, but I didn't really understand why.

I like your proposed outline to handle stderr content. I would like to see that in all places, not just one one framework.

Lets leave that for #3203 and have ERROR_QUITE removed here once and for all.

EnricoMi · 2022-02-03T19:31:13Z

This pull request removes 8 tests.

@maxhgerlach any idea why some of the tests may have disappeared (not even skipped)?

EnricoMi · 2022-02-03T19:38:44Z

This pull request removes 8 tests.

@maxhgerlach any idea why some of the tests may have disappeared (not even skipped)?

Looks like the HEAD failures did not make it into the test results. But this would also mean that these tests only run against HEAD versions.

maxhgerlach

Hmm, there were numerous problems with tf-nightly over the last few days. I looked into them for a bit because at first there appeared to be linker failures related to the changes I introduced with #3128, but those disappeared with a newer nightly build. Seems to be somewhat in turmoil.

I don't know off hand which tests would only be run on head versions.

Allow stderr of executed cmake python code appear in logs

c66b7bd

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi requested review from nvcastet and maxhgerlach February 3, 2022 14:29

maxhgerlach approved these changes Feb 4, 2022

View reviewed changes

EnricoMi merged commit 046c071 into master Feb 4, 2022

EnricoMi deleted the branch-cmake-stderr branch February 4, 2022 12:11

EnricoMi mentioned this pull request Apr 19, 2022

Fix for Issue 2322 #3203

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow stderr of executed cmake python code appear in logs #3398

Allow stderr of executed cmake python code appear in logs #3398

EnricoMi commented Feb 3, 2022

maxhgerlach commented Feb 3, 2022

github-actions bot commented Feb 3, 2022 •

edited

github-actions bot commented Feb 3, 2022 •

edited

EnricoMi commented Feb 3, 2022

EnricoMi commented Feb 3, 2022

EnricoMi commented Feb 3, 2022 •

edited

maxhgerlach left a comment •

edited

Allow stderr of executed cmake python code appear in logs #3398

Allow stderr of executed cmake python code appear in logs #3398

Conversation

EnricoMi commented Feb 3, 2022

maxhgerlach commented Feb 3, 2022

github-actions bot commented Feb 3, 2022 • edited

Unit Test Results

github-actions bot commented Feb 3, 2022 • edited

Unit Test Results (with flaky tests)

EnricoMi commented Feb 3, 2022

EnricoMi commented Feb 3, 2022

EnricoMi commented Feb 3, 2022 • edited

maxhgerlach left a comment • edited

Choose a reason for hiding this comment

github-actions bot commented Feb 3, 2022 •

edited

github-actions bot commented Feb 3, 2022 •

edited

EnricoMi commented Feb 3, 2022 •

edited

maxhgerlach left a comment •

edited