PTMQ: Post-training Multi-Bit Quantization of Neural Networks

2024-08-11

Keywords: #Activation #Batch Normalization

1. Introduction

QAT multi-bit quantization problems (★★★)
- Time-consuming
- When switching bit-width, BN stats. need to be recalulated (AdaptiveBN) → affects the real-time response of multi-bit quantization.
PTQ approaches: Minimizing discrepancy between FP32 and quantized feature maps using MSE, cosine distance.
1. Searching for quantization scale factors.
  - Insufficient for achieving robust multi-bit quantization
  - Full-parameters fine-tuning can lead to overfitting with limited data.
2. Optimizing rounding values. → Promising, but challenging.
Folding norm layers with previous layers helps mitigate the influence of statistical parameters.
PTMQ pipeline
Contributions
- Propose PTMQ, a PTQ multi-bit quantization framework. PTMQ supports uniform and mixed-precision quantization, and can perform real-time bit-width conversion.
- Propose Multi-bit Feature Mixer (MFM), which enhances the robustness of rounding values at various bit-widths by fusing features of different bit-widths.
- Introduce Group-wise Distillation Loss (GD-Loss), to enhance the correlation between different bit-width groups, improving overall quantization performance of PTMQ.
- Experiments conducted on CNN and ViT backbones.

Rounding-based PTQ: optimizing 1) rounding value, 2) scaling factor
- AdaRound (2020): Learning the rounding mechanism by analyzing the second-order error term. → Proposes layer-by-layer reconstruction of the output.
- BRECQ (2021): Extends AdaRound to block-wise reconstruction.
- Qdrop (2022): Higher accuracy can be achieved by randomly dropping quantized act. values and incorporating act. quant. into the weight tuning.
- FlexRound (2023): Weights can be mapped to a wider range of quantized values, rather than being limited to only the nearby 0 or 1 values.
- Problem: They can only quantize one specific bit-width at a time, not directly accessible to perform multi-bit quantization.
Multi-Bit Quantization:
- RobustQuant (2020): Uniformly distributed weight tensor is more tolerant to quantization, higher SNR, and less sensitive to specific quantization settings than normally-distributed weights → Proposes kurtosis regularization RobustQuant
- Any-Precision: Trains using DoReFa, and the quantized model is stored in FP32. In runtime, this model can be set to different bit-width by truncation. Any-Precision DNN
- MultiQuant (2022): Addresses vicious competition between high-bit-width and low-bit-width quantization → Proposes adaptive soft-labeling strategy
- Problem: Previous methods focus on QAT.

3. Approach

3.1 PTMQ Modeling

Multi-bit quantization model formulation:
For PTQ, quantization parameters are usually optimized as block-independent sequences.
PTQ aims to learn robust rounding values and stand-alone quantization step size.
$\mathbb{R} \in {0,1}$: The rounding value for up or down in inference.

3.2 The Pipeline of PTMQ

Scaling factors are initialized. Block-wise optimization is adopted for reconstruction.
Bit-width configuration $\mathbb{B}$ is divided into three groups; Each require different scaling levels of tuning during reconstruction.
During reconstruction of a block, a random bit is selected for each bit group. (Inputs) = (The integration of output features from different bit groups) -> through Multi-bit Feature Mixer
(Output of block undergoing reconstruction) = (Sample random bits from each bit group)
- Reconstruction loss = Sum of all loss of sampled bit-widths.
- Rounding values/act. step size within block optimized by back-prop. - PTMQ aims to enhance robustness of quantized block by improving lower & upper performance bound.

3.3 Opimization-based Multi-Bit Quantization

Absorb Multi-Bit Errors into Rounding Values

Activation quantization: $\hat a = a (1+u)$ -> $u$ affected by bit-width and rounding error.

Transfer noise; activations to weights: $w[a\odot (1+u(x))] = [w\odot(1+v(x))]a$

Similarly, quantization noise $(1+\tilde{u}(x))$ (due to activation & multi bit-widths) can be transposed to weight perturbation $(1+\tilde{v}(x))$ = \mathbb{R}

Intuition: Activation quantization & multi bit-width = Introducing multi-bit quantization errors into the network weights = Absorbed by rounding values.

Multi-Bit Feature Mixer

Mixing features of multi-bit quantization as inputs enhance the robustness of rounding values for various bit-widths.

Case 1) Randomly select one type of feature among L/M/H bit-width groups, and fuse it with FP features through random dropout (Qdrop, 2022).
Case 2) Random bit features sampled from each group. -> H features mixed M features through addition -> Randomly replace with L features and produce fused feature.
Case 3) Random bit features sampled from each group and mixed with addition. -> Fused with FP features by random dropout. (★ Most effective! ★)
Key point: Input of the reconstruction block combines multiple quantization features instead of only biasing towards one quantization feature.
Formulation of MFM:

PTMQ: Post-training Multi-Bit Quantization of Neural Networks

1. Introduction

2. Related Works

3. Approach

3.1 PTMQ Modeling

3.2 The Pipeline of PTMQ

3.3 Opimization-based Multi-Bit Quantization