AutoGPTQ integration #924

Andrei-Aksionov · 2024-02-12T16:14:55Z

Hi there 👋

It's a bit late response to #583

The task itself turned out to be quite large, so in order to speed up the process (and simplify life for those who will review the PR) I decided to include only the basics: the code can quantize model and run an inference, supports all the AutoGPTQ kernels. The rest parts of AutoGPTQ functionality will be added in a subsequent pull requests.

This PR doesn't include:

loading/uploading quantized weights to/from HF hub
AWQ support (yes, AutoGPTQ supports even this)
fused attention and MLP layers (looking forward to implementing it ~~no I don't~~)
maybe even something more, but not sure that we should integrate everything

Benchmarks

Benchmarking was done on 1xA10G with TinyLlama model (1.1B parameters).

Quantization config:

4bit precision
128 group_size
desc_act (act_order) was disabled.

There are two tables: for prefill stage and for new tokens generation stage.
Prefill simulation was done by feeding first 1024 samples from Alpaca dataset into the model and the result was averaged across them. Here only one sample at a time was sent to the model.
New token generation was done by generating 100 new tokens 100 times. The default prompt from generate/base.py was used.

Prefill

Quantization	Kernel	Precision	Token/sec	VRAM, GB	Perplexity	Utilization
None	-	16	8422	2.67	14.01	76
bnb.nf4	-	4	4813	1.42	13.95	95
gptq	triton	4	3819	1.69	14.36	70
gptq	cuda_old	4	5151	1.69	14.36	99
gptq	cuda	4	4554	1.69	14.36	99
gptq	exllama	4	7965	1.69	14.29	95
gptq	exllamav2	4	7872	1.69	14.29	94
gptq	marlin	4	8560	1.68	14.17	75

New tokens generation

Quantization	Kernel	Precision	Token/sec	VRAM, GB	Utilization, %
None	-	16	55.47	2.23	46
bnb.nf4	-	4	44.92	1.03	32
gptq	triton	4	27.52	1.31	46
gptq	cuda_old	4	46.17	1.31	30
gptq	cuda	4	39.93	1.31	97
gptq	exllama	4	57.38	1.31	35
gptq	exllamav2	4	57.04	1.31	28
gptq	marlin	4	55.91	1.32	30

*Most likely these kernels are optimized for A100. That might explain not impressive results and low utilization.

Here one can find benchmarks made by HF team. They also show that Marlin kernel turns out to be the fastest, but not as fast as was expected.

Note

Marlin kernel only support graphics cards with compute capability >= 8.0. Here one can find a table with graphics cards and their compute capabilities.

Caveats:

It's not possible to run inference with GPTQ quantization and model compilation.

…utoGPTQ

Andrei-Aksionov added 8 commits February 9, 2024 11:51

AutoGPTQ quantization

159d5fa

Simplified forward in GPTForAutoGPT

45fc777

Test for AutoGPTQ quantization

ff92e88

Add LLaMAMoE literal to _mlp_class

3e583b5

Support of MoE layer

bfcddc4

Support QuantLinear conversion + Test

4110b84

Save quantize config

5969307

WIP: inference

912b495

Andrei-Aksionov requested review from awaelchli, carmocca and lantiga as code owners February 12, 2024 16:14

Andrei-Aksionov marked this pull request as draft February 12, 2024 16:15

Andrei-Aksionov added 4 commits February 13, 2024 15:58

Fix test: path contains number of bits in the name

0577dd5

Moved auto-gptq from requirements-all: don't support CPU

57ee563

Support loading AutoGPTQ on meta device and get away with it ;)

bf9601c

Support for Marlin kernel

23cd55f

Andrei-Aksionov force-pushed the autogptq_integration branch from 0a8a0b1 to 3f69b53 Compare February 18, 2024 14:18

Remove quantize.autogptq from "make sure all modules are importable"

68d8d61

Andrei-Aksionov force-pushed the autogptq_integration branch from 3f69b53 to 68d8d61 Compare February 18, 2024 14:36

Andrei-Aksionov added 11 commits February 18, 2024 15:16

Merge branch 'main' into autogptq_integration

8dee13b

Run AutoGPTQ test only on a GPU

129f80c

Code finalized

7d93a3f

Simplify strip_bias method

146daec

Update test: forward pass for marlin conversion test

ed24b2c

Print quantize config during generation

7af5b79

Quantize into quantize folder instead of autogptq

2c476d7

QuantizeConfig: limit printed info

b624466

Fix test: search in quantize folder instead of autogptq

eb029c7

Test: Fix relative import for Phi tests

044b800

Readme for quantization

46e351e

Andrei-Aksionov and others added 17 commits February 27, 2024 20:05

Merge branch 'main' into autogptq_integration

5502db0

Auto code formatting

1cd36d0

Update Readme and a docstring

13a5445

Merge branch 'main' into autogptq_integration

c7bdbd9

Logging now is done on AutoGPTQ side

7f59ee6

pytest-rerunfailures decorator

2a1488a

Minor refactoring.

d1c3f05

Updated layer_conversion test

3bac4f2

Test for generate/base with AutoGPTQ

90625ec

AutoGPTQ: quantized is False by default

e5d17b1

Relative import in test_convert_lit for Phi models

f09af81

Simplify test for quantize. We should assume that this is tested by A…

a8a3a61

…utoGPTQ

Bump-up AutoGPTQ version. This includes awaited fixes

907ddf1

Merge branch 'main' into autogptq_integration

d2123d9

Tests update

18c9ca7

Refactor: kernel --> gptq_kernel

2b12d3f

Merge branch 'main' into autogptq_integration

3db21c0

Andrei-Aksionov force-pushed the autogptq_integration branch from 58e4c74 to 3db21c0 Compare March 1, 2024 14:30

Andrei-Aksionov marked this pull request as ready for review March 1, 2024 15:37

Andrei-Aksionov added the quantization label Mar 1, 2024

Andrei-Aksionov mentioned this pull request Mar 8, 2024

Half-Quadratic Quantization HQQ #1059

Open

Andrei-Aksionov mentioned this pull request Apr 16, 2024

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoGPTQ integration #924

AutoGPTQ integration #924

Andrei-Aksionov commented Feb 12, 2024 •

edited

AutoGPTQ integration #924

Are you sure you want to change the base?

AutoGPTQ integration #924

Conversation

Andrei-Aksionov commented Feb 12, 2024 • edited

Benchmarks

Andrei-Aksionov commented Feb 12, 2024 •

edited