chatglm-q

一个仅供参考的 ChatGLM2 实现，去掉了 Huggingface transformers 依赖。这个实现为 ONNX 模型导出、int4 和 int8 量化等进行了优化调整。由于当前 OpenAI Triton 只支持 Linux，在 Windows 上，需要使用支持 Cuda 的 WSL2 运行（Win 11 或者 Win 10 21H2+）。

A ChatGLM2 reference implementation without Huggingface transformers. This implementation is optimized for ONNX export, int8 & int4 GPTQ quantization and more. Currently OpenAI Triton only supports Linux. If your host system is Windows 11 or Windows 10 21H2 and higher, WSL 2 with Cuda is also supported.

Updates

本仓库已全面升级到 ChatGLM2，不再支持第一代 ChatGLM-6b，历史代码参考 chatglm-legacy 分支。

注：由于 ONNXRuntime MatMulInteger 算子问题，v2 模型的 ONNX 量化模型无法在 GPU 上运行，在 x86-64 的 CPU 上存在数值偏差，无法正常输出。

Installation

首先需要安装依赖项 PyTorch。如果是 PyTorch 2.0 版本，其内置了 OpenAI Triton，否则需要额外安装 triton 包。这个包同时依赖 build-essential 构建工具。接下来可以通过下面的命令安装模块：

Install PyTorch first. OpenAI Triton is also packed with PyTorch 2 Linux releases. If you are using PyTorch 1.x, you should install triton manually. This package also requires build toolchain like build-essential. Then install this package with following command:

pip install --upgrade git+https://github.com/K024/chatglm-q

你也可以先克隆该仓库再使用 pip install . 来安装。

You can also clone this repo and install it with pip install ..

对于 WSL2 的用户，请先确认当前的 WSL 版本。不要在 WSL2 中手工安装 nvidia-driver，参考 nvidia docs。

For WSL 2 users, first check your WSL version. DO NOT install nvidia-driver in WSL 2. For more details, see nvidia docs.

> wsl --list -v
  NAME      STATE           VERSION
* Ubuntu    Running         2

Usage

import torch
from chatglm_q.decoder import ChatGLMDecoder, chat_template

device = torch.device("cuda")
# optionally pass a `torch_dtype=torch.float16` to set the activation dtype
decoder = ChatGLMDecoder.from_pretrained("K024/chatglm2-6b-int4g32", device=device)

prompt = chat_template([], "我是谁？")
for text in decoder.generate(prompt):
    print(text)

权重转换、模型量化、ONNX 模型导出等内容，请参考 examples 下的文件并作出必要的修改。

For more examples like weight conversion, manual quantization, onnx model export and more, please check out examples directory and make your own modifications.

Web UI

pip install streamlit
cd examples
streamlit run web-ui.py

Avaiable models

Type	Huggingface Hub	Recommended For
int8	K024/chatglm2-6b-int8	Linux/WSL2 CUDA 9G+ VRAM
int4g32	K024/chatglm2-6b-int4g32	Linux/WSL2 CUDA 6G+ VRAM

模型权重按照原始模型相同的协议发布，见 MODEL LICENSE。

The model weights are released under the same license as ChatGLM2-6b, see MODEL LICENSE.

TODO

Integration with Huggingface Transformers
Support cuda operator on windows
P-Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
chatglm_q		chatglm_q
examples		examples
tests		tests
.gitignore		.gitignore
license		license
readme.md		readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chatglm_q

chatglm_q

examples

examples

tests

tests

.gitignore

.gitignore

license

license

readme.md

readme.md

setup.py

setup.py

Repository files navigation

chatglm-q

Updates

Installation

Usage

Web UI

Avaiable models

TODO

About

Releases

Packages

Languages

License

K024/chatglm-q

Folders and files

Latest commit

History

Repository files navigation

chatglm-q

Updates

Installation

Usage

Web UI

Avaiable models

TODO

About

Resources

License

Stars

Watchers

Forks

Languages