Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torch/elastic) add fqdn hostname to error printout (#66182) #66662

Merged
merged 1 commit into from Oct 15, 2021

Conversation

kiukchung
Copy link
Collaborator

@kiukchung kiukchung commented Oct 14, 2021

Summary:
Pull Request resolved: #66182

closes #63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
@pytorch-probot
Copy link

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/ebfe8c96d631f1cb2b428fb2c62e729478d348d5/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-bionic-py3.8-gcc9-coverage ciflow/all, ciflow/coverage, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
win-vs2019-cuda10.2-py3 ciflow/all, ciflow/cuda, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 14, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit ebfe8c9 (more details on the Dr. CI page):


  • 5/5 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py
Auto-merging .circleci/cimodel/data/dimensions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py
Auto-merging .circleci/cimodel/data/binary_build_data.py
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/set-environment-variables.yml
Auto-merging .azure_pipelines/job_templates/set-environment-variables.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/prepare-build-template.yml
Auto-merging .azure_pipelines/job_templates/prepare-build-template.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py
Auto-merging .circleci/cimodel/data/dimensions.py
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/binary_build_data.py
Auto-merging .circleci/cimodel/data/binary_build_data.py
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/set-environment-variables.yml
Auto-merging .azure_pipelines/job_templates/set-environment-variables.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/prepare-build-template.yml
Auto-merging .azure_pipelines/job_templates/prepare-build-template.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-14T23:03:03.6567076Z RuntimeError:
2021-10-14T23:03:03.0384859Z Author: PyTorch Team
2021-10-14T23:03:03.0385351Z Author-email: packages@pytorch.org
2021-10-14T23:03:03.0385839Z License: BSD-3
2021-10-14T23:03:03.0386349Z Location: /opt/conda/lib/python3.6/site-packages
2021-10-14T23:03:03.0386998Z Requires: dataclasses, typing-extensions
2021-10-14T23:03:03.0387514Z Required-by: 
2021-10-14T23:03:03.0588034Z + python check_backward_compatibility.py --existing-schemas nightly_schemas.txt
2021-10-14T23:03:03.6565088Z Traceback (most recent call last):
2021-10-14T23:03:03.6566201Z   File "check_backward_compatibility.py", line 155, in <module>
2021-10-14T23:03:03.6566719Z     s = parse_schema(line.strip())
2021-10-14T23:03:03.6567076Z RuntimeError: 
2021-10-14T23:03:03.6567591Z Unknown custom class type cuda.Stream. Please ensure it is registered.:
2021-10-14T23:03:03.6569006Z cuda::default_stream.device(Device? device) -> (__torch__.torch.classes.cuda.Stream)
2021-10-14T23:03:03.6569773Z                                                                              ~~~~~~ <--- HERE
2021-10-14T23:03:03.6570008Z 
2021-10-14T23:03:03.7572197Z + cleanup
2021-10-14T23:03:03.7572840Z + retcode=1
2021-10-14T23:03:03.7573265Z + set +x
2021-10-14T23:03:03.7573605Z =================== sccache compilation log ===================
2021-10-14T23:03:03.7765947Z =========== If your build fails, please take a look at the log above for possible reasons ===========
2021-10-14T23:03:03.7850341Z Compile requests                      0

2 failures not recognized by patterns:

Job Step Action
GitHub Actions Lint / quick-checks Ensure no trailing spaces 🔁 rerun
GitHub Actions Lint / flake8-py3 Fail if there were any warnings 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants