Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dnl] add NCCL/PT debug log for S413673 #125085

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

minsii
Copy link
Contributor

@minsii minsii commented Apr 27, 2024

Test Plan:
Smoke test w/ NCCL ut cannot repro segfault w/ dynamic register + len 213942272

IFNAME=eth2 HOSTS="rtptest908.pci1,rtptest693.pci1" ENVS="NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL,ALLOC" buck2 run fbcode//mode/opt fbsource//third-party/nccl-exp/v2.18.3-1/src/ctran/tests:ctran_dist_allgather

P1223852264

Differential Revision: D56659330

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented Apr 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125085

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cf676b5 with merge base 368f521 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 27, 2024
Test Plan:
Smoke test w/ NCCL ut cannot repro segfault w/ dynamic register + len 213942272
```
IFNAME=eth2 HOSTS="rtptest908.pci1,rtptest693.pci1" ENVS="NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL,ALLOC" buck2 run fbcode//mode/opt fbsource//third-party/nccl-exp/v2.18.3-1/src/ctran/tests:ctran_dist_allgather
```
P1223852264

Differential Revision: D56659330
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant