Use BFloat16 in distributed quantization when supported by NCCL #125113

cyyever · 2024-04-28T02:06:16Z

This PR enables BFloat16 in torch/csrc/distributed/c10d/quantization/quantization_gpu.cu .

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-04-28T02:06:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125113

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4694f99 with merge base 91a4740 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

dagitses

FYI i'm no longer a part of the project, so I can't approve changes.

dagitses · 2024-04-28T11:09:13Z

torch/csrc/distributed/c10d/quantization/quantization_gpu.cu

@@ -69,15 +69,16 @@ at::Tensor _float_to_bfloat16_cuda(const at::Tensor& input) {

  auto output = at::empty(
      {nrows, output_columns},
-      input.options().dtype(at::kHalf)); // at::kHalf
+#if HAS_NCCL_BF16_DATATYPE
+      input.options().dtype(at::kBFloat16));


fyi i don't think you need to do this one in the preprocessor, you should be able to do it like:

input.options().dtype(HAS_NCCL_BF16_DATATYPE ? at::kBFloat16 : at::kHalf));

HAS_NCCL_BF16_DATATYPE is a macro and I think it's better to format code like this so that it is easy to identify and remove the old branch in the future.

kwen2501 · 2024-05-01T05:39:41Z

torch/csrc/distributed/c10d/quantization/quantization_gpu.cu

+      reinterpret_cast<uint16_t*>(output.mutable_data_ptr<at::Half>())
+#endif
+      );
+  C10_CUDA_KERNEL_LAUNCH_CHECK();


what does the C10_CUDA_KERNEL_LAUNCH_CHECK function do? What's the purpose of uncommenting it?

cyyever · 2024-05-01T05:41:38Z

@pytorchmergebot merge

pytorchmergebot · 2024-05-01T05:43:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#125113) This PR enables BFloat16 in torch/csrc/distributed/c10d/quantization/quantization_gpu.cu . Pull Request resolved: pytorch#125113 Approved by: https://github.com/kwen2501

Use BFloat16 in distributed quantization when supported by NCCL

b34eac4

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Apr 28, 2024

cyyever requested a review from dagitses April 28, 2024 02:07

pytorchbot added the open source label Apr 28, 2024

Add C10_CUDA_KERNEL_LAUNCH_CHECK

4694f99

cyyever force-pushed the quantization_bf16 branch from 33ee903 to 4694f99 Compare April 28, 2024 02:22

cyyever added the ciflow/inductor label Apr 28, 2024

dagitses reviewed Apr 28, 2024

View reviewed changes

cyyever added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 29, 2024

cyyever requested a review from Skylion007 April 29, 2024 01:43

cpuhrsch requested a review from wconstab April 30, 2024 19:46

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 30, 2024

kwen2501 approved these changes May 1, 2024

View reviewed changes

pytorchmergebot added the merging label May 1, 2024

pytorchmergebot added the Merged label May 1, 2024

pytorchmergebot closed this in 081f41a May 1, 2024

pytorchmergebot removed the merging label May 1, 2024

cyyever deleted the quantization_bf16 branch May 5, 2024 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BFloat16 in distributed quantization when supported by NCCL #125113

Use BFloat16 in distributed quantization when supported by NCCL #125113

cyyever commented Apr 28, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 28, 2024 •

edited

dagitses left a comment

dagitses Apr 28, 2024

cyyever Apr 30, 2024

kwen2501 May 1, 2024

cyyever commented May 1, 2024

pytorchmergebot commented May 1, 2024

Use BFloat16 in distributed quantization when supported by NCCL #125113

Use BFloat16 in distributed quantization when supported by NCCL #125113

Conversation

cyyever commented Apr 28, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 28, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125113

✅ No Failures

dagitses left a comment

Choose a reason for hiding this comment

dagitses Apr 28, 2024

Choose a reason for hiding this comment

cyyever Apr 30, 2024

Choose a reason for hiding this comment

kwen2501 May 1, 2024

Choose a reason for hiding this comment

cyyever commented May 1, 2024

pytorchmergebot commented May 1, 2024

Merge started

cyyever commented Apr 28, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 28, 2024 •

edited