Quantization Support In ONNX

ONNX quantization Requirements:

Interoperability MUST be ensured. ONLY widely accepted quantization schema can be standardized in ONNX. In this design, 8 bits linear (scale/zero_point) quantization will be standardized.
Customized quantization schema should be allowed. ONNX should be able to represent customized quantization schemas (the schema hasn’t been standardized in ONNX yet) with a subgraph consisting of primitive operators.
All ONNX operators must define a mathematical function of the following form: outputs = OP(inputs, attrs) It means the data needed for mathematical calculation defined by an op must be either an input or an attribute.
Enable both static and dynamic quantization. Quantization parameters used in defining an op will be defined as inputs/outputs. Static quantization will be a special case of dynamic one, where the quantization parameter inputs are from either initializers or constant nodes. NOTE: as a best practice, weights in an inference model should be statically quantized.
Support model verification for static quantization models. The verification includes, a. Same tensor should have same real-value representation. If they use same static quantization parameters, then this can be ensured. b. Any other kind of quantization parameters’ value check before sending a model to a hardware vendor.

Goals of this design:

Add a small set of operators to standardize 8 bits linear (scale/zero_point) quantization.
Add a small set of operators to further enable ONNX to represent other quantization schemas.
Add quantization information as model level annotation for easy model verification.

Status:

To support 8 bit linear (scale/zero_point) quantization, QuantizeLinear/DequantizeLinear/QLinearConv/QLinearMatmul were added.
To enable other quantization schemas, ConvInteger/MatmulInteger were added.
More operators/quantized data types will be added as needed.

Provide feedback