Skip to main content Link Menu Expand (external link) Document Search Copy Copied

MultiQuant: Training Once for Multi-Bit Quantization of Neural Networks

2024-08-06



1. Introduction

  • Main problem: QAT simulates the quantization process during training, making the trained model highly dependent on the target bit-width.
  • Contributions:
    1. Propose a novel MultiQuant framework, LRH Co-Training, to support different bit-width subnets with uniform and mixed-precision quantization w/o retraining.
    2. Find the problem of vicious competitionbetween high and low bit-widths in supernet training. Design an online adaptive label to alleviate it.
    3. Propose Monte Carlo sampling instead of uniform sampling to improve efficiency of mixed precision search.
  • OFA (2020): Progressive shrinking to reduce interference between different sub-nets in weight-sharing supernet training.
  • BigNAS (2020): Sandwich rule to achieve single-stage model training. -> Inspiration for LRH Co-Training strategy.
  • AttentiveNAS (2020): Attentive sampling of networks on Pareto-best and Pareto-worst to improve performance.
  • Key point: Train a single over-parameterized supernet that can be directly sampled/sliced to different candidate sub-nets for instant inference/deployment.

  • APQ (2020, mixed precision search): Use genetic algorithms by collecting quantized data points to realize the joint search of network architecture-pruning-quantization. (???)

All-in-Once Quantization of Neural Networks

  • Alizadeh et al. (2020):
  • RobustQuant (2020): Uniformly distributed weight tensor is more tolerant to quantization, higher SNR, and less sensitive to specific quantization settings than normally-distributed weights → Proposes kurtosis regularization RobustQuant
  • CoQuant (2021):
  • Any-Precision (2021): Trains using DoReFa, and the quantized model is stored in FP32. In runtime, this model can be set to different bit-width by truncation. Any-Precision DNN
  • OQAT (2021):
  • Problem: Above approaches implicitly constrain the weight distribution- using design loss function or knowledge transfer, and does not discuss the interaction of different bit-widths on weight-sharing quantization.
  • This work identifies the problem of vicious competition between high and low bit-widths in supernet training due to the hard label.

3. Approach

3.1 Multi-Bit Quantization Modeling

  • Formulation of multi-bit quantization

3.2 Training the Multi-Bit Quantization Supernet

1. Lowest-Random-Highest Bit-Width Co-Training 2. Online Adaptive Label

  • OFA-like NAS vs MultiQuant (OFA network in general):
    • OFA-like NAS: # of weights of biggest/smallest child models are different. -> Can isolate precision conflict by training differentiated parameters. (?)
    • OFA network in general: Weights are completely shared under all subnets. -> Can only coordinate the weight distribution to adapt to different bit-width configurations.
  • After gradient accumulation, parameters adapt to the 2-bit quantization distribution, which in turn impairs the 8-bit quantization accuracy.
    • Variance of confidence score: 8-bit (high, bad) > 2-bit (low, good)
    • Loss measured with hard label cross-entropy: 8-bit (smaller strength) < 2-bit (bigger strength)
  • Proposal: Online adaptive label
    • Summary: Soft labels based on the statistics of the LRH quantization model prediction are used to supervise the supernet.

3.3 Search Pareto Frontier for Mixed-Precision