Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torch/elastic) add fqdn hostname to error printout #66182

Closed
wants to merge 1 commit into from

Commits on Oct 7, 2021

  1. (torch/elastic) add fqdn hostname to error printout (pytorch#66182)

    Summary:
    Pull Request resolved: pytorch#66182
    
    closes pytorch#63174
    
    Does a few things:
    
    1. adds hostname to the error report
    2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
    3. moves redundant error info logging to debug
    4. makes the border max 60 char in length and justifies left for the header
    
    NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).
    
    Test Plan:
    Sample
    
    ```
    ============================================================
    run_script_path FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2021-10-05_17:37:22
      host      : devvm4955.prn0.facebook.com
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 3296201)
      error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
      traceback :
      Traceback (most recent call last):
        File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
          return f(*args, **kwargs)
        File "main.py", line 28, in main
          raise RuntimeError(args.throws)
      RuntimeError: foobar
    
    ============================================================
    ```
    
    Reviewed By: cbalioglu, aivanou
    
    Differential Revision: D31416492
    
    fbshipit-source-id: 7490e1a90b8083fd38329f321cc09ab8b8713b26
    Kiuk Chung authored and facebook-github-bot committed Oct 7, 2021
    Configuration menu
    Copy the full SHA
    7a148f6 View commit details
    Browse the repository at this point in the history