New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(torch/elastic) add fqdn hostname to error printout #66182
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 7a148f6 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D31416492 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for working on that, @kiukchung - this is great! Two requests:
Thank you! |
Its not something we can enable from commandline unfortunately. the launcher now launches the training script copies via
You could run the script with
If you have additional suggestions, please feel free to send us a PR. |
Thank you for clarifying that host/rank printing has nothing to do with Won't this decorator be a problem if the program isn't run under distributed?
It'd help users a lot if https://pytorch.org/docs/stable/distributed.html included a section linking to various elastic docs like the above. Otherwise, the switch was done, and it feels like the user should somehow know that they should be searching for detailed docs of elastic, but not knowing they exist or where to find those.
Please don't tell it'll print this warning on each process, since at times all child processes will fail - multiple that 1024... ouch! |
Just to clarify, I don't intend to obstruct this PR with my questions. Please go ahead and merge it if it looks good to you as you have added what I asked for - thank you! - the related discussion can happen separately. |
This pull request was exported from Phabricator. Differential Revision: D31416492 |
22a1091
to
9008ff2
Compare
I've made a few changes to this PR based on your suggestions @stas00
WITH @record annotation
WITHOUT @record annotation
there isn't a great way since dist doesn't own the "entrypoint" to your training script. This error reporting mechanism btw is a bit tangential to distributed (it just so happens to be sitting under torch.distributed.elastic.multiprocessing.errors module), this is really a poor-man's way of propagating tracebacks using files between processes since python doesn't natively support exception handling across processes. So this could be used independent to dist
The decorator is treated like any other new module/feature in pytorch. It is valid as of torch-1.9.0, to use it we would expect you to be running with torch-1.9.0+ |
This pull request was exported from Phabricator. Differential Revision: D31416492 |
9008ff2
to
cc912aa
Compare
That's fantastic, thank you, @kiukchung
right, so the generic software that doesn't know which pytorch the user may use it with will have to use a conditional decorator to wrap around |
yeah - having thought about this last night - this is an interesting point. We typically operate with the assumption that the end pytorch user owns their training script (at least the main module) and hence knows what pytorch version they will be running the script with. If the script author/owner is different from the script invoker then it does raise the question as to whether the script's dev (knowing that they don't control the run environment) needs to author the script with BC in mind or perhaps just check |
A concrete example is HF Transformers example scripts that many users use as is - and while we call those examples, they are used as production code by many users. Surely they can modify these scripts if they want to. But the script is the same otherwise, and it can be invoked in many different ways - w/ dist or w/o and it dynamically figures out which right thing to do at run time. We also use those scripts as is in the test suite, e.g. I have a whole set of these being called to test the Deepspeed integration, and the test suite should be able to work not just with pt-1.9.0 - that's why if I were to use that decorator I'd have to use a conditional wrapper to make it not fail for pt<1.9. So the workaround would do. I'm thinking that if files are used to communicate tracebacks, the perhaps the same approach can be used to distribute configuration across nodes? Then each process will know of any nuances... Potentially something for the future discussions and not for 1.10. |
thanks for the insight + example. cc'ing my team here since we are thinking about these production painpoints in a new PyTorch companion project TorchX (https://pytorch.org/torchx/latest/). I'll put some more thought into this and we can brainstorm more on slack - I think this is a good "checklist" to keep in mind when developing new (or maintaining) pytorch features. |
This pull request was exported from Phabricator. Differential Revision: D31416492 |
cc912aa
to
2ac3bcf
Compare
Summary: Pull Request resolved: pytorch#66182 closes pytorch#63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 7490e1a90b8083fd38329f321cc09ab8b8713b26
This pull request was exported from Phabricator. Differential Revision: D31416492 |
2ac3bcf
to
7a148f6
Compare
Summary: Pull Request resolved: #66182 closes #63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Summary: Pull Request resolved: #66182 closes #63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Summary:
closes #63174
Does a few things:
NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation).
Test Plan:
Sample
Differential Revision: D31416492
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang