Skip to content
This repository has been archived by the owner on Dec 24, 2023. It is now read-only.

K024/chatglm-q

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chatglm-q

一个仅供参考的 ChatGLM2 实现,去掉了 Huggingface transformers 依赖。这个实现为 ONNX 模型导出、int4 和 int8 量化等进行了优化调整。由于当前 OpenAI Triton 只支持 Linux,在 Windows 上,需要使用支持 Cuda 的 WSL2 运行(Win 11 或者 Win 10 21H2+)。

A ChatGLM2 reference implementation without Huggingface transformers. This implementation is optimized for ONNX export, int8 & int4 GPTQ quantization and more. Currently OpenAI Triton only supports Linux. If your host system is Windows 11 or Windows 10 21H2 and higher, WSL 2 with Cuda is also supported.

Updates

本仓库已全面升级到 ChatGLM2,不再支持第一代 ChatGLM-6b,历史代码参考 chatglm-legacy 分支。

注:由于 ONNXRuntime MatMulInteger 算子问题,v2 模型的 ONNX 量化模型无法在 GPU 上运行,在 x86-64 的 CPU 上存在数值偏差,无法正常输出。

Installation

首先需要安装依赖项 PyTorch。如果是 PyTorch 2.0 版本,其内置了 OpenAI Triton,否则需要额外安装 triton 包。这个包同时依赖 build-essential 构建工具。接下来可以通过下面的命令安装模块:

Install PyTorch first. OpenAI Triton is also packed with PyTorch 2 Linux releases. If you are using PyTorch 1.x, you should install triton manually. This package also requires build toolchain like build-essential. Then install this package with following command:

pip install --upgrade git+https://github.com/K024/chatglm-q

你也可以先克隆该仓库再使用 pip install . 来安装。

You can also clone this repo and install it with pip install ..


对于 WSL2 的用户,请先确认当前的 WSL 版本。不要在 WSL2 中手工安装 nvidia-driver,参考 nvidia docs

For WSL 2 users, first check your WSL version. DO NOT install nvidia-driver in WSL 2. For more details, see nvidia docs.

> wsl --list -v
  NAME      STATE           VERSION
* Ubuntu    Running         2

Usage

import torch
from chatglm_q.decoder import ChatGLMDecoder, chat_template

device = torch.device("cuda")
# optionally pass a `torch_dtype=torch.float16` to set the activation dtype
decoder = ChatGLMDecoder.from_pretrained("K024/chatglm2-6b-int4g32", device=device)

prompt = chat_template([], "我是谁?")
for text in decoder.generate(prompt):
    print(text)

权重转换、模型量化、ONNX 模型导出等内容,请参考 examples 下的文件并作出必要的修改。

For more examples like weight conversion, manual quantization, onnx model export and more, please check out examples directory and make your own modifications.

Web UI

pip install streamlit
cd examples
streamlit run web-ui.py

Avaiable models

Type Huggingface Hub Recommended For
int8 K024/chatglm2-6b-int8 Linux/WSL2 CUDA 9G+ VRAM
int4g32 K024/chatglm2-6b-int4g32 Linux/WSL2 CUDA 6G+ VRAM

模型权重按照原始模型相同的协议发布,见 MODEL LICENSE

The model weights are released under the same license as ChatGLM2-6b, see MODEL LICENSE.

TODO

  • Integration with Huggingface Transformers
  • Support cuda operator on windows
  • P-Tuning

About

Another ChatGLM2 implementation for GPTQ quantization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages