Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(torch/elastic) add fqdn hostname to error printout #66182

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Collaborator

@kiukchung kiukchung commented Oct 6, 2021

Summary:
closes #63174

Does a few things:

  1. adds hostname to the error report
  2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
  3. moves redundant error info logging to debug
  4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================

Differential Revision: D31416492

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@pytorch-probot
Copy link

pytorch-probot bot commented Oct 6, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/kiukchung/pytorch/blob/7a148f63453ea11374af72f691afe231c0528975/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
periodic-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/slow, ciflow/slow-gradcheck ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 6, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 7a148f6 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31416492

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 6, 2021
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stas00
Copy link
Contributor

stas00 commented Oct 6, 2021

Thank you for working on that, @kiukchung - this is great!

Two requests:

  1. Could we please add a cl arg then to enable torch.distributed.elastic.multiprocessing.errors.record and not need to modify the code?

    It'd make the code backward compatible for when older pytorch is used and where torch.distributed.elastic.multiprocessing.errors.record doesn't exit.

  2. Could we please document this feature?

Thank you!

@kiukchung
Copy link
Collaborator Author

Could we please add a cl arg then to enable torch.distributed.elastic.multiprocessing.errors.record and not need to modify the code?

Its not something we can enable from commandline unfortunately. the launcher now launches the training script copies via subprocess.Popen. So you'd have to actually do this in the training script main function. If you don't have access to it, then the tracebacks won't be available but it does not affect the correctness. Instead it'll print something like:

============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Process failed with exitcode 1
============================================================

You could run the script with $ python -m torch.distributed.run --run_path script.py which uses the runpy module to run the script from within the main interpreter in which case it'll annotate the wrapper function but the downside is that you'll get a longer stack trace:

============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_18:02:08
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3417837)
  error_file: /home/kiuk/tmp/elastic/none_mi6drml2/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.myl31u5c/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "/tmp/jetter.myl31u5c/torch/distributed/run.py", line 673, in run_script_path
      runpy.run_path(sys.argv[0], run_name="__main__")
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 265, in run_path
      return _run_module_code(code, init_globals, run_name,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 97, in _run_module_code
      _run_code(code, mod_globals, init_globals,
    File "/usr/local/fbcode/platform009/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "main.py", line 34, in <module>
      main()
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

Could we please document this feature?

  1. https://pytorch.org/docs/stable/elastic/errors.html
  2. the launcher has been fitted with a warning message with detailed information on how to add @record to their main:
CHILD PROCESS FAILED WITH NO ERROR_FILE

Child process 3421311 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train

If you have additional suggestions, please feel free to send us a PR.

@stas00
Copy link
Contributor

stas00 commented Oct 6, 2021

Could we please add a cl arg then to enable torch.distributed.elastic.multiprocessing.errors.record and not need to modify the code?
[...]

Thank you for clarifying that host/rank printing has nothing to do with record decoration.

Won't this decorator be a problem if the program isn't run under distributed?

  • Should there be a way to activate it dynamically if the programs knows it's being run under dist?
  • And additionally it's an issue with backward compatibility, how do you add a decorator which may or may not exist depending on the version of pytorch used by the user?

Could we please document this feature?

1. https://pytorch.org/docs/stable/elastic/errors.html

It'd help users a lot if https://pytorch.org/docs/stable/distributed.html included a section linking to various elastic docs like the above. Otherwise, the switch was done, and it feels like the user should somehow know that they should be searching for detailed docs of elastic, but not knowing they exist or where to find those.

  1. the launcher has been fitted with a warning message with detailed information on how to add @record to their main:

CHILD PROCESS FAILED WITH NO ERROR_FILE

Child process 3421311 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

Please don't tell it'll print this warning on each process, since at times all child processes will fail - multiple that 1024... ouch!

@stas00
Copy link
Contributor

stas00 commented Oct 6, 2021

Just to clarify, I don't intend to obstruct this PR with my questions. Please go ahead and merge it if it looks good to you as you have added what I asked for - thank you! - the related discussion can happen separately.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31416492

@kiukchung
Copy link
Collaborator Author

I've made a few changes to this PR based on your suggestions @stas00

  1. Documented the error summary behavior and @record annotation in the distributed.run docstring with a link to https://pytorch.org/docs/stable/elastic/errors.html.
  2. Condensed the CHILD PROCESS FAILED WITH NO ERROR_FILE ... warning block to a single line and made this a log.info (so by default LOGLEVEL=WARN this won't really be printed out)
  3. The error summary (on no error file) will link to the documentation on how to get this enabled in the "traceback" row:

WITH @record annotation

============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_23:02:55
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 823868)
  error_file: /home/kiuk/tmp/elastic/none__x9t_83m/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/tmp/jetter.n0nojxzn/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

WITHOUT @record annotation

============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2021-10-05_22:54:15
  host      : devvm4955.prn0.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 754251)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_22:54:15
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 754250)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

  1. I've fixed the mis-formatted OMP warning message:
# BEFORE:
WARNING:__main__:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

# AFTER
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

Should there be a way to activate it dynamically if the programs knows it's being run under dist?

there isn't a great way since dist doesn't own the "entrypoint" to your training script. This error reporting mechanism btw is a bit tangential to distributed (it just so happens to be sitting under torch.distributed.elastic.multiprocessing.errors module), this is really a poor-man's way of propagating tracebacks using files between processes since python doesn't natively support exception handling across processes. So this could be used independent to dist

And additionally it's an issue with backward compatibility, how do you add a decorator which may or may not exist depending on the version of pytorch used by the user?

The decorator is treated like any other new module/feature in pytorch. It is valid as of torch-1.9.0, to use it we would expect you to be running with torch-1.9.0+

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31416492

@stas00
Copy link
Contributor

stas00 commented Oct 6, 2021

That's fantastic, thank you, @kiukchung

And additionally it's an issue with backward compatibility, how do you add a decorator which may or may not exist depending on the version of pytorch used by the user?

The decorator is treated like any other new module/feature in pytorch. It is valid as of torch-1.9.0, to use it we would expect you to be running with torch-1.9.0+

right, so the generic software that doesn't know which pytorch the user may use it with will have to use a conditional decorator to wrap around @record to ensure it work with either pytorch version. https://stackoverflow.com/a/10724898/9201239

@kiukchung
Copy link
Collaborator Author

yeah - having thought about this last night - this is an interesting point. We typically operate with the assumption that the end pytorch user owns their training script (at least the main module) and hence knows what pytorch version they will be running the script with. If the script author/owner is different from the script invoker then it does raise the question as to whether the script's dev (knowing that they don't control the run environment) needs to author the script with BC in mind or perhaps just check torch.__version__ and fail fast if the versions don't match.

@stas00
Copy link
Contributor

stas00 commented Oct 6, 2021

A concrete example is HF Transformers example scripts that many users use as is - and while we call those examples, they are used as production code by many users. Surely they can modify these scripts if they want to.

But the script is the same otherwise, and it can be invoked in many different ways - w/ dist or w/o and it dynamically figures out which right thing to do at run time.

We also use those scripts as is in the test suite, e.g. I have a whole set of these being called to test the Deepspeed integration, and the test suite should be able to work not just with pt-1.9.0 - that's why if I were to use that decorator I'd have to use a conditional wrapper to make it not fail for pt<1.9. So the workaround would do.

I'm thinking that if files are used to communicate tracebacks, the perhaps the same approach can be used to distribute configuration across nodes? Then each process will know of any nuances... Potentially something for the future discussions and not for 1.10.

@kiukchung
Copy link
Collaborator Author

thanks for the insight + example. cc'ing my team here since we are thinking about these production painpoints in a new PyTorch companion project TorchX (https://pytorch.org/torchx/latest/). I'll put some more thought into this and we can brainstorm more on slack - I think this is a good "checklist" to keep in mind when developing new (or maintaining) pytorch features.

cc) @d4l3k @aivanou @dbish

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31416492

Summary:
Pull Request resolved: pytorch#66182

closes pytorch#63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 7490e1a90b8083fd38329f321cc09ab8b8713b26
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31416492

@kiukchung kiukchung added this to the 1.10.0 milestone Oct 14, 2021
kiukchung pushed a commit that referenced this pull request Oct 14, 2021
Summary:
Pull Request resolved: #66182

closes #63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
malfet pushed a commit that referenced this pull request Oct 15, 2021
Summary:
Pull Request resolved: #66182

closes #63174

Does a few things:

1. adds hostname to the error report
2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end)
3. moves redundant error info logging to debug
4. makes the border max 60 char in length and justifies left for the header

NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).

Test Plan:
Sample

```
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-10-05_17:37:22
  host      : devvm4955.prn0.facebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3296201)
  error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json
  traceback :
  Traceback (most recent call last):
    File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper
      return f(*args, **kwargs)
    File "main.py", line 28, in main
      raise RuntimeError(args.throws)
  RuntimeError: foobar

============================================================
```

Reviewed By: cbalioglu, aivanou

Differential Revision: D31416492

fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dist] log node's hostname plus rank in the exception message
4 participants