New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix torchelastic import error due to unsupported signal SIGKILL on Windows #88250
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88250
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit b47d28a: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
super().__init__() | ||
self._file_path = file_path | ||
self.signal = signal | ||
self.signal = _signal.SIGKILL if signal is None else signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be used instead for windows support?
def _get_kill_signal() -> signal.Signals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh awesome! I didn't know this exists. I think we should definitely use this :) Updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the changes @awaelchli ! Just to confirm, have you validated that this works on Windows? It seems that there is no CI support for distributed/
on Windows so this needs to be manually validated
@H-Huang Yes. This is how I have verified it: Windows 11. Compiled pytorch from source (made sure USE_DISTRIBUTED=1).
Making sure it reproduces:
Raises: Verify fix:
Import succeeds. |
Fixes #85427
The signal
signal.SIGKILL
is not supported on Windows. Since this was included in a type hint, an error is raised directly at import time ofdistributed/elastic
. The fix in this PR proposes the default None and choosing the SIGKILL default only at runtime.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @peterjc123 @mszhanyi @skyline75489 @nbcsm