MultiQuant: Training Once for Multi-Bit Quantization of Neural Networks

2024-08-06

1. Introduction

Main problem: QAT simulates the quantization process during training, making the trained model highly dependent on the target bit-width.
Contributions:
1. Propose a novel MultiQuant framework, LRH Co-Training, to support different bit-width subnets with uniform and mixed-precision quantization w/o retraining.
2. Find the problem of vicious competitionbetween high and low bit-widths in supernet training. Design an online adaptive label to alleviate it.
3. Propose Monte Carlo sampling instead of uniform sampling to improve efficiency of mixed precision search.

OFA (2020): Progressive shrinking to reduce interference between different sub-nets in weight-sharing supernet training.
BigNAS (2020): Sandwich rule to achieve single-stage model training. -> Inspiration for LRH Co-Training strategy.
AttentiveNAS (2020): Attentive sampling of networks on Pareto-best and Pareto-worst to improve performance.
Key point: Train a single over-parameterized supernet that can be directly sampled/sliced to different candidate sub-nets for instant inference/deployment.
APQ (2020, mixed precision search): Use genetic algorithms by collecting quantized data points to realize the joint search of network architecture-pruning-quantization. (???)

Alizadeh et al. (2020):
RobustQuant (2020): Uniformly distributed weight tensor is more tolerant to quantization, higher SNR, and less sensitive to specific quantization settings than normally-distributed weights → Proposes kurtosis regularization RobustQuant
CoQuant (2021):
Any-Precision (2021): Trains using DoReFa, and the quantized model is stored in FP32. In runtime, this model can be set to different bit-width by truncation. Any-Precision DNN
OQAT (2021):
Problem: Above approaches implicitly constrain the weight distribution- using design loss function or knowledge transfer, and does not discuss the interaction of different bit-widths on weight-sharing quantization.
This work identifies the problem of vicious competition between high and low bit-widths in supernet training due to the hard label.

1. Lowest-Random-Highest Bit-Width Co-Training 2. Online Adaptive Label

OFA-like NAS vs MultiQuant (OFA network in general):
- OFA-like NAS: # of weights of biggest/smallest child models are different. -> Can isolate precision conflict by training differentiated parameters. (?)
- OFA network in general: Weights are completely shared under all subnets. -> Can only coordinate the weight distribution to adapt to different bit-width configurations.
After gradient accumulation, parameters adapt to the 2-bit quantization distribution, which in turn impairs the 8-bit quantization accuracy.
- Variance of confidence score: 8-bit (high, bad) > 2-bit (low, good)
- Loss measured with hard label cross-entropy: 8-bit (smaller strength) < 2-bit (bigger strength)
Proposal: Online adaptive label
- Summary: Soft labels based on the statistics of the LRH quantization model prediction are used to supervise the supernet.