Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoGPTQ integration #924

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

Andrei-Aksionov
Copy link
Collaborator

@Andrei-Aksionov Andrei-Aksionov commented Feb 12, 2024

Hi there 👋

It's a bit late response to #583

The task itself turned out to be quite large, so in order to speed up the process (and simplify life for those who will review the PR) I decided to include only the basics: the code can quantize model and run an inference, supports all the AutoGPTQ kernels. The rest parts of AutoGPTQ functionality will be added in a subsequent pull requests.

This PR doesn't include:

  • loading/uploading quantized weights to/from HF hub
  • AWQ support (yes, AutoGPTQ supports even this)
  • fused attention and MLP layers (looking forward to implementing it no I don't)
  • maybe even something more, but not sure that we should integrate everything

Benchmarks

Benchmarking was done on 1xA10G with TinyLlama model (1.1B parameters).

Quantization config:

  • 4bit precision
  • 128 group_size
  • desc_act (act_order) was disabled.

There are two tables: for prefill stage and for new tokens generation stage.
Prefill simulation was done by feeding first 1024 samples from Alpaca dataset into the model and the result was averaged across them. Here only one sample at a time was sent to the model.
New token generation was done by generating 100 new tokens 100 times. The default prompt from generate/base.py was used.

Prefill

Quantization Kernel Precision Token/sec VRAM, GB Perplexity Utilization
None - 16 8422 2.67 14.01 76
bnb.nf4 - 4 4813 1.42 13.95 95
gptq triton 4 3819 1.69 14.36 70
gptq cuda_old 4 5151 1.69 14.36 99
gptq cuda 4 4554 1.69 14.36 99
gptq exllama 4 7965 1.69 14.29 95
gptq exllamav2 4 7872 1.69 14.29 94
gptq marlin 4 8560 1.68 14.17 75

New tokens generation

Quantization Kernel Precision Token/sec VRAM, GB Utilization, %
None - 16 55.47 2.23 46
bnb.nf4 - 4 44.92 1.03 32
gptq triton 4 27.52 1.31 46
gptq cuda_old 4 46.17 1.31 30
gptq cuda 4 39.93 1.31 97
gptq exllama 4 57.38 1.31 35
gptq exllamav2 4 57.04 1.31 28
gptq marlin 4 55.91 1.32 30

*Most likely these kernels are optimized for A100. That might explain not impressive results and low utilization.

Here one can find benchmarks made by HF team. They also show that Marlin kernel turns out to be the fastest, but not as fast as was expected.

Note

Marlin kernel only support graphics cards with compute capability >= 8.0. Here one can find a table with graphics cards and their compute capabilities.

Caveats:

  • It's not possible to run inference with GPTQ quantization and model compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant