Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix torchelastic import error due to unsupported signal SIGKILL on Windows #88250

Closed

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Nov 1, 2022

Fixes #85427

The signal signal.SIGKILL is not supported on Windows. Since this was included in a type hint, an error is raised directly at import time of distributed/elastic. The fix in this PR proposes the default None and choosing the SIGKILL default only at runtime.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @peterjc123 @mszhanyi @skyline75489 @nbcsm

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 1, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88250

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit b47d28a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 1, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: awaelchli / name: Adrian Wälchli (f484487)

@awaelchli awaelchli changed the title Fix unsupported signal SIGKILL on Windows Fix torchelastic import error due to unsupported signal SIGKILL on Windows Nov 1, 2022
super().__init__()
self._file_path = file_path
self.signal = signal
self.signal = _signal.SIGKILL if signal is None else signal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be used instead for windows support?

def _get_kill_signal() -> signal.Signals:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh awesome! I didn't know this exists. I think we should definitely use this :) Updated!

@H-Huang H-Huang added oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue labels Nov 2, 2022
@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 2, 2022
@kit1980 kit1980 added the module: windows Windows support for PyTorch label Nov 4, 2022
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes @awaelchli ! Just to confirm, have you validated that this works on Windows? It seems that there is no CI support for distributed/ on Windows so this needs to be manually validated

@awaelchli
Copy link
Contributor Author

awaelchli commented Nov 5, 2022

@H-Huang Yes. This is how I have verified it:

Windows 11. Compiled pytorch from source (made sure USE_DISTRIBUTED=1).

(pytorch) C:\Users\Adrian\Documents\pytorch>pip freeze

astunparse==1.6.3
brotlipy==0.7.0
certifi==2022.9.24
cffi @ file:///C:/Windows/Temp/abs_6808y9x40v/croots/recipe/cffi_1659598653989/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cryptography @ file:///C:/b/abs_36x9ifdcl4/croot/cryptography_1665612655344/work
future @ file:///C:/ci_310/future_1642081234309/work
idna @ file:///C:/b/abs_bdhbebrioa/croot/idna_1666125572046/work
mkl-fft==1.3.1
mkl-random @ file:///C:/ci_310/mkl_random_1643050563308/work
mkl-service==2.4.0
mpmath==1.2.1
networkx==3.0b1
numpy @ file:///C:/b/abs_53f_dbvhzc/croot/numpy_and_numpy_base_1665773185489/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
PySocks @ file:///C:/ci_310/pysocks_1642089375450/work
PyYAML==6.0
requests @ file:///C:/ci/requests_1657735340829/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
sympy==1.11.1
-e git+https://github.com/awaelchli/pytorch.git@c40033be162db0f94d37e7ccbd2a89d67f8b8e47#egg=torch
typing_extensions @ file:///C:/Windows/TEMP/abs_dd2d0moa85/croots/recipe/typing_extensions_1659638831135/work
urllib3 @ file:///C:/b/abs_a8_3vfznn_/croot/urllib3_1666298943664/work
win-inet-pton @ file:///C:/ci_310/win_inet_pton_1642658466512/work
wincertstore==0.2
(pytorch) C:\Users\Adrian\Documents\pytorch>python -c "import torch; print(torch.__version__)"

1.14.0a0+gitb47d28a

Making sure it reproduces:

(pytorch) C:\Users\Adrian\Documents\pytorch>git checkout master
(pytorch) C:\Users\Adrian\Documents\pytorch>python -c "import torch.distributed.run"

Raises:
AttributeError: module 'signal' has no attribute 'SIGKILL'. Did you mean: 'SIGILL'?
as reported in the linked issue.

Verify fix:

(pytorch) C:\Users\Adrian\Documents\pytorch>git checkout bugfix/signal-sigkill-windows
(pytorch) C:\Users\Adrian\Documents\pytorch>python -c "import torch.distributed.run"

Import succeeds.

@awaelchli
Copy link
Contributor Author

@H-Huang It looks like there is another PR by @malfet #88522

@awaelchli
Copy link
Contributor Author

@H-Huang #88522 landed and so there is no longer a need for this PR. I wish there was more coordination here to avoid duplicated work.

@awaelchli awaelchli closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: windows Windows support for PyTorch oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torchrun AttributeError caused by file_based_local_timer on Windows
4 participants