Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate triton row and blockwise fp8 gemm to llm inference. #2547

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

choutim
Copy link
Contributor

@choutim choutim commented Apr 29, 2024

Summary:
Integrate triton row/block-wise fp8gemm to llm inference

Next commits will add activation_scale_ub and fused rowwise quant.

Differential Revision: D56503139

Summary:
Add fp8 row/block-wise GEMM kernels with tests and benchmarks. Will register benchmark with TritonBench in separate pr.

H100 500W
```bf16:                   shape (8192, 8192, 8192)   tflops 585.23   ms 1.879
fp8 scale + row gemm:   shape (8192, 8192, 8192)   tflops 931.80   ms 1.180
fp8 scale + block gemm: shape (8192, 8192, 8192)   tflops 594.84   ms 1.848
fp8 row gemm only:      shape (8192, 8192, 8192)   tflops 1125.51  ms 0.977
fp8 block gemm only:    shape (8192, 8192, 8192)   tflops 870.40   ms 1.263

bf16:                   shape (65536, 8192, 7168)  tflops 575.12   ms 13.383
fp8 scale + row gemm:   shape (65536, 8192, 7168)  tflops 1024.09  ms 7.516
fp8 scale + block gemm: shape (65536, 8192, 7168)  tflops 762.04   ms 10.100
fp8 row gemm only:      shape (65536, 8192, 7168)  tflops 1082.75  ms 7.108
fp8 block gemm only:    shape (65536, 8192, 7168)  tflops 828.34   ms 9.292

bf16:                   shape (65536, 3584, 8192)  tflops 546.31   ms 7.044
fp8 scale + row gemm:   shape (65536, 3584, 8192)  tflops 876.66   ms 4.390
fp8 scale + block gemm: shape (65536, 3584, 8192)  tflops 547.62   ms 7.027
fp8 row gemm only:      shape (65536, 3584, 8192)  tflops 1141.38  ms 3.372
fp8 block gemm only:    shape (65536, 3584, 8192)  tflops 828.31   ms 4.646
```

Differential Revision: https://www.internalfb.com/diff/D56337896?entry_point=27
Copy link

netlify bot commented Apr 29, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit c6f166b
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6633cec069f67a000809b574
😎 Deploy Preview https://deploy-preview-2547--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary:
Integrate triton row/block-wise fp8gemm to llm inference

Next commits will add activation_scale_ub and fused rowwise quant.

Differential Revision: D56503139
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56503139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants