Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stderr of executed cmake python code appear in logs #3398

Merged
merged 1 commit into from Feb 4, 2022

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Feb 3, 2022

CMake executes Python code to detect installed frameworks. If that code fails, there is no trace in the logs, which makes identifying the cause very difficult. We only see it could not find the framework:

  #44 63.09     CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  #44 63.09       Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
  #44 63.09       version "1.4.0")
  #44 63.09     Call Stack (most recent call first):
  #44 63.09       /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  #44 63.09       cmake/Modules/FindMxnet.cmake:65 (find_package_handle_standard_args)
  #44 63.09       horovod/mxnet/CMakeLists.txt:12 (find_package)

This allows errors from the Python code to leak into the build log:

  #44 63.07     Traceback (most recent call last):
  #44 63.07       File "<string>", line 1, in <module>
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/__init__.py", line 23, in <module>
  #44 63.07         from .context import Context, current_context, cpu, gpu, cpu_pinned
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/context.py", line 20, in <module>
  #44 63.07         from .base import _LIB
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/base.py", line 293, in <module>
  #44 63.07         _LIB = _load_lib()
  #44 63.07       File "/usr/local/lib/python3.8/dist-packages/mxnet/base.py", line 284, in _load_lib
  #44 63.07         lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  #44 63.07       File "/usr/lib/python3.8/ctypes/__init__.py", line 369, in __init__
  #44 63.07         self._handle = _dlopen(self._name, mode)
  #44 63.07     OSError: libcusolver.so.10: cannot open shared object file: No such file or directory
  #44 63.09     CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  #44 63.09       Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least
  #44 63.09       version "1.4.0")
  #44 63.09     Call Stack (most recent call first):
  #44 63.09       /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
  #44 63.09       cmake/Modules/FindMxnet.cmake:65 (find_package_handle_standard_args)
  #44 63.09       horovod/mxnet/CMakeLists.txt:12 (find_package)

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@maxhgerlach
Copy link
Collaborator

There was a related PR #3203, which also wanted to change logging levels, but I didn't really understand why.

@github-actions
Copy link

github-actions bot commented Feb 3, 2022

Unit Test Results

     801 files  ±    0       801 suites  ±0   9h 45m 57s ⏱️ + 28m 9s
     722 tests  -     8       675 ✔️  -     4       47 💤 +    4  0 ±0 
17 357 runs  +278  12 182 ✔️ +136  5 175 💤 +158  0 ±0 

Results for commit c66b7bd. ± Comparison against base commit 19040e9.

This pull request removes 8 tests.
test.parallel.test_adasum_tensorflow
test.parallel.test_keras
test.parallel.test_process_sets_multi_comm
test.parallel.test_process_sets_static
test.parallel.test_tensorflow
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_xla
This pull request skips 4 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Feb 3, 2022

Unit Test Results (with flaky tests)

     895 files   -      94       895 suites   - 94   10h 26m 7s ⏱️ - 1h 17m 20s
     722 tests  -        8       673 ✔️  -        3       47 💤 +    4  2  - 1 
19 424 runs   - 1 519  13 488 ✔️  - 1 222  5 934 💤  - 242  2  - 7 

For more details on these failures, see this check.

Results for commit c66b7bd. ± Comparison against base commit 19040e9.

This pull request removes 8 tests.
test.parallel.test_adasum_tensorflow
test.parallel.test_keras
test.parallel.test_process_sets_multi_comm
test.parallel.test_process_sets_static
test.parallel.test_tensorflow
test.parallel.test_tensorflow2_keras
test.parallel.test_tensorflow_keras
test.parallel.test_xla
This pull request skips 4 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Feb 3, 2022

There was a related PR #3203, which also wanted to change logging levels, but I didn't really understand why.

I like your proposed outline to handle stderr content. I would like to see that in all places, not just one one framework.

Lets leave that for #3203 and have ERROR_QUITE removed here once and for all.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Feb 3, 2022

This pull request removes 8 tests.

@maxhgerlach any idea why some of the tests may have disappeared (not even skipped)?

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Feb 3, 2022

This pull request removes 8 tests.

@maxhgerlach any idea why some of the tests may have disappeared (not even skipped)?

Looks like the HEAD failures did not make it into the test results. But this would also mean that these tests only run against HEAD versions.

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there were numerous problems with tf-nightly over the last few days. I looked into them for a bit because at first there appeared to be linker failures related to the changes I introduced with #3128, but those disappeared with a newer nightly build. Seems to be somewhat in turmoil.

I don't know off hand which tests would only be run on head versions.

@EnricoMi EnricoMi merged commit 046c071 into master Feb 4, 2022
@EnricoMi EnricoMi deleted the branch-cmake-stderr branch February 4, 2022 12:11
@EnricoMi EnricoMi mentioned this pull request Apr 19, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants