Skip to main content Link Menu Expand (external link) Document Search Copy Copied

PTMQ: Post-training Multi-Bit Quantization of Neural Networks

2024-08-11


Keywords: #Activation #Batch Normalization


1. Introduction

  • QAT multi-bit quantization problems (★★★)
    • Time-consuming
    • When switching bit-width, BN stats. need to be recalulated (AdaptiveBN) → affects the real-time response of multi-bit quantization.
  • PTQ approaches: Minimizing discrepancy between FP32 and quantized feature maps using MSE, cosine distance.
    1. Searching for quantization scale factors.
      • Insufficient for achieving robust multi-bit quantization
      • Full-parameters fine-tuning can lead to overfitting with limited data.
    2. Optimizing rounding values. → Promising, but challenging.
  • Folding norm layers with previous layers helps mitigate the influence of statistical parameters.

  • PTMQ pipeline

  • Contributions
    • Propose PTMQ, a PTQ multi-bit quantization framework. PTMQ supports uniform and mixed-precision quantization, and can perform real-time bit-width conversion.
    • Propose Multi-bit Feature Mixer (MFM), which enhances the robustness of rounding values at various bit-widths by fusing features of different bit-widths.
    • Introduce Group-wise Distillation Loss (GD-Loss), to enhance the correlation between different bit-width groups, improving overall quantization performance of PTMQ.
    • Experiments conducted on CNN and ViT backbones.
  • Rounding-based PTQ: optimizing 1) rounding value, 2) scaling factor
    • AdaRound (2020): Learning the rounding mechanism by analyzing the second-order error term. → Proposes layer-by-layer reconstruction of the output.
    • BRECQ (2021): Extends AdaRound to block-wise reconstruction.
    • Qdrop (2022): Higher accuracy can be achieved by randomly dropping quantized act. values and incorporating act. quant. into the weight tuning.
    • FlexRound (2023): Weights can be mapped to a wider range of quantized values, rather than being limited to only the nearby 0 or 1 values.
    • Problem: They can only quantize one specific bit-width at a time, not directly accessible to perform multi-bit quantization.
  • Multi-Bit Quantization:
    • RobustQuant (2020): Uniformly distributed weight tensor is more tolerant to quantization, higher SNR, and less sensitive to specific quantization settings than normally-distributed weights → Proposes kurtosis regularization RobustQuant
    • Any-Precision: Trains using DoReFa, and the quantized model is stored in FP32. In runtime, this model can be set to different bit-width by truncation. Any-Precision DNN
    • MultiQuant (2022): Addresses vicious competition between high-bit-width and low-bit-width quantization → Proposes adaptive soft-labeling strategy
    • Problem: Previous methods focus on QAT.

3. Approach

3.1 PTMQ Modeling

  • Multi-bit quantization model formulation:

  • For PTQ, quantization parameters are usually optimized as block-independent sequences.

  • PTQ aims to learn robust rounding values and stand-alone quantization step size.

  • $\mathbb{R} \in {0,1}$: The rounding value for up or down in inference.

3.2 The Pipeline of PTMQ

  1. Scaling factors are initialized. Block-wise optimization is adopted for reconstruction.
  2. Bit-width configuration $\mathbb{B}$ is divided into three groups; Each require different scaling levels of tuning during reconstruction.
  3. During reconstruction of a block, a random bit is selected for each bit group. (Inputs) = (The integration of output features from different bit groups) -> through Multi-bit Feature Mixer
  4. (Output of block undergoing reconstruction) = (Sample random bits from each bit group)
    • Reconstruction loss = Sum of all loss of sampled bit-widths.
    • Rounding values/act. step size within block optimized by back-prop. - PTMQ aims to enhance robustness of quantized block by improving lower & upper performance bound.

3.3 Opimization-based Multi-Bit Quantization

Absorb Multi-Bit Errors into Rounding Values

  • Activation quantization: $\hat a = a (1+u)$ -> $u$ affected by bit-width and rounding error.
  • Transfer noise; activations to weights: $w[a\odot (1+u(x))] = [w\odot(1+v(x))]a$
  • Similarly, quantization noise $(1+\tilde{u}(x))$ (due to activation & multi bit-widths) can be transposed to weight perturbation $(1+\tilde{v}(x))$ = \mathbb{R}
  • Intuition: Activation quantization & multi bit-width = Introducing multi-bit quantization errors into the network weights = Absorbed by rounding values.

Multi-Bit Feature Mixer

  • Mixing features of multi-bit quantization as inputs enhance the robustness of rounding values for various bit-widths.
  • Case 1) Randomly select one type of feature among L/M/H bit-width groups, and fuse it with FP features through random dropout (Qdrop, 2022).
  • Case 2) Random bit features sampled from each group. -> H features mixed M features through addition -> Randomly replace with L features and produce fused feature.
  • Case 3) Random bit features sampled from each group and mixed with addition. -> Fused with FP features by random dropout. (★ Most effective! ★)

  • Key point: Input of the reconstruction block combines multiple quantization features instead of only biasing towards one quantization feature.
  • Formulation of MFM: