Integrate triton row and blockwise fp8 gemm to llm inference. #2547

choutim · 2024-04-29T22:14:19Z

Summary:
Integrate triton row/block-wise fp8gemm to llm inference

Next commits will add activation_scale_ub and fused rowwise quant.

Differential Revision: D56503139

Summary: Add fp8 row/block-wise GEMM kernels with tests and benchmarks. Will register benchmark with TritonBench in separate pr. H100 500W ```bf16: shape (8192, 8192, 8192) tflops 585.23 ms 1.879 fp8 scale + row gemm: shape (8192, 8192, 8192) tflops 931.80 ms 1.180 fp8 scale + block gemm: shape (8192, 8192, 8192) tflops 594.84 ms 1.848 fp8 row gemm only: shape (8192, 8192, 8192) tflops 1125.51 ms 0.977 fp8 block gemm only: shape (8192, 8192, 8192) tflops 870.40 ms 1.263 bf16: shape (65536, 8192, 7168) tflops 575.12 ms 13.383 fp8 scale + row gemm: shape (65536, 8192, 7168) tflops 1024.09 ms 7.516 fp8 scale + block gemm: shape (65536, 8192, 7168) tflops 762.04 ms 10.100 fp8 row gemm only: shape (65536, 8192, 7168) tflops 1082.75 ms 7.108 fp8 block gemm only: shape (65536, 8192, 7168) tflops 828.34 ms 9.292 bf16: shape (65536, 3584, 8192) tflops 546.31 ms 7.044 fp8 scale + row gemm: shape (65536, 3584, 8192) tflops 876.66 ms 4.390 fp8 scale + block gemm: shape (65536, 3584, 8192) tflops 547.62 ms 7.027 fp8 row gemm only: shape (65536, 3584, 8192) tflops 1141.38 ms 3.372 fp8 block gemm only: shape (65536, 3584, 8192) tflops 828.31 ms 4.646 ``` Differential Revision: https://www.internalfb.com/diff/D56337896?entry_point=27

netlify · 2024-04-29T22:14:35Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`c6f166b`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6633cec069f67a000809b574
😎 Deploy Preview	https://deploy-preview-2547--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: Integrate triton row/block-wise fp8gemm to llm inference Next commits will add activation_scale_ub and fused rowwise quant. Differential Revision: D56503139

facebook-github-bot · 2024-05-02T17:34:52Z

This pull request was exported from Phabricator. Differential Revision: D56503139

facebook-github-bot added the cla signed label Apr 29, 2024

Integrate triton row and blockwise fp8 gemm to llm inference.

c6f166b

Summary: Integrate triton row/block-wise fp8gemm to llm inference Next commits will add activation_scale_ub and fused rowwise quant. Differential Revision: D56503139

facebook-github-bot added the fb-exported label May 2, 2024

choutim force-pushed the export-D56503139 branch from 8e8c47c to c6f166b Compare May 2, 2024 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate triton row and blockwise fp8 gemm to llm inference. #2547

Integrate triton row and blockwise fp8 gemm to llm inference. #2547

choutim commented Apr 29, 2024

netlify bot commented Apr 29, 2024 •

edited

facebook-github-bot commented May 2, 2024

Integrate triton row and blockwise fp8 gemm to llm inference. #2547

Are you sure you want to change the base?

Integrate triton row and blockwise fp8 gemm to llm inference. #2547

Conversation

choutim commented Apr 29, 2024

netlify bot commented Apr 29, 2024 • edited

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented May 2, 2024

netlify bot commented Apr 29, 2024 •

edited