Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torch/elastic) add fqdn hostname to error printout (#66182) #66662

Merged
merged 1 commit into from Oct 15, 2021

Commits on Oct 14, 2021

  1. (torch/elastic) add fqdn hostname to error printout (#66182)

    Summary:
    Pull Request resolved: #66182
    
    closes #63174
    
    Does a few things:
    
    1. adds hostname to the error report
    2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
    3. moves redundant error info logging to debug
    4. makes the border max 60 char in length and justifies left for the header
    
    NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).
    
    Test Plan:
    Sample
    
    ```
    ============================================================
    run_script_path FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2021-10-05_17:37:22
      host      : devvm4955.prn0.facebook.com
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 3296201)
      error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
      traceback :
      Traceback (most recent call last):
        File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
          return f(*args, **kwargs)
        File "main.py", line 28, in main
          raise RuntimeError(args.throws)
      RuntimeError: foobar
    
    ============================================================
    ```
    
    Reviewed By: cbalioglu, aivanou
    
    Differential Revision: D31416492
    
    fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
    Kiuk Chung committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    ebfe8c9 View commit details
    Browse the repository at this point in the history