Neural Compressor

An open-source Python library supporting popular model compression techniques for ONNX

Neural Compressor aims to provide popular model compression techniques inherited from Intel Neural Compressor yet focused on ONNX model quantization such as SmoothQuant, weight-only quantization through ONNX Runtime. In particular, the tool provides the key features, typical examples, and open collaborations as below:

Support a wide range of Intel hardware such as Intel Xeon Scalable Processors, Intel Xeon CPU Max Series; support AMD CPU, ARM CPU, and NVidia GPU with limited testing
Validate popular LLMs such as LLama2 and broad models such as BERT-base, and ResNet50 from popular model hubs such as Hugging Face, ONNX Model Zoo, by leveraging automatic accuracy-driven quantization strategies
Collaborate with software platforms such as Microsoft Olive, and open AI ecosystem such as Hugging Face, ONNX and ONNX Runtime

Installation

Install from source

git clone https://github.com/onnx/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
pip install .

Note: Further installation methods can be found under Installation Guide.

Getting Started

Setting up the environment:

pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx

After successfully installing these packages, try your first quantization program.

Notes: please install from source before the formal pypi release.

Weight-Only Quantization (LLMs)

Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Nvidia GPU, device will be selected for efficiency automatically when multiple devices are available.

Run the example:

from onnx_neural_compressor.quantization import matmul_nbits_quantizer

algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig()
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
    model,
    n_bits=4,
    block_size=32,
    is_symmetric=True,
    algo_config=algo_config,
)
quant.process()
best_model = quant.model

Static Quantization

from onnx_neural_compressor.quantization import quantize, StaticQuantConfig
from onnx_neural_compressor.quantization.calibrate import CalibrationDataReader


class DataReader(CalibrationDataReader):
    def __init__(self):
        self.encoded_list = []
        # append data into self.encoded_list

        self.iter_next = iter(self.encoded_list)

    def get_next(self):
        return next(self.iter_next, None)

    def rewind(self):
        self.iter_next = iter(self.encoded_list)


data_reader = DataReader()
config = StaticQuantConfig(calibration_data_reader=data_reader)
quantize(model, output_model_path, config)

Documentation

Overview
Architecture	Workflow		Examples
Feature
Quantization		SmoothQuant
Weight-Only Quantization (INT8/INT4)		Layer-Wise Quantization

Additional Content

Communication

GitHub Issues: mainly for bug reports, new feature requests, question asking, etc.
Email: welcome to raise any interesting research ideas on model compression techniques by email for collaborations.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.azure-pipelines		.azure-pipelines
.github		.github
docs		docs
examples		examples
onnx_neural_compressor		onnx_neural_compressor
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

onnx/neural-compressor

Folders and files

Latest commit

History

Repository files navigation

Neural Compressor

An open-source Python library supporting popular model compression techniques for ONNX

Installation

Install from source

Getting Started

Weight-Only Quantization (LLMs)

Static Quantization

Documentation

Additional Content

Communication

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages