MultiQuant: Training Once for Multi-Bit Quantization of Neural Networks
2024-08-06
1. Introduction
- Main problem: QAT simulates the quantization process during training, making the trained model highly dependent on the target bit-width.
- Contributions:
- Propose a novel MultiQuant framework, LRH Co-Training, to support different bit-width subnets with uniform and mixed-precision quantization w/o retraining.
- Find the problem of vicious competitionbetween high and low bit-widths in supernet training. Design an online adaptive label to alleviate it.
- Propose Monte Carlo sampling instead of uniform sampling to improve efficiency of mixed precision search.
2. Related Works
All-in-Once Network Architecture Search
- OFA (2020): Progressive shrinking to reduce interference between different sub-nets in weight-sharing supernet training.
- BigNAS (2020): Sandwich rule to achieve single-stage model training. -> Inspiration for LRH Co-Training strategy.
- AttentiveNAS (2020): Attentive sampling of networks on Pareto-best and Pareto-worst to improve performance.
-
Key point: Train a single over-parameterized supernet that can be directly sampled/sliced to different candidate sub-nets for instant inference/deployment.
- APQ (2020, mixed precision search): Use genetic algorithms by collecting quantized data points to realize the joint search of network architecture-pruning-quantization. (???)
All-in-Once Quantization of Neural Networks
- Alizadeh et al. (2020):
- RobustQuant (2020): Uniformly distributed weight tensor is more tolerant to quantization, higher SNR, and less sensitive to specific quantization settings than normally-distributed weights → Proposes kurtosis regularization RobustQuant
- CoQuant (2021):
- Any-Precision (2021): Trains using DoReFa, and the quantized model is stored in FP32. In runtime, this model can be set to different bit-width by truncation. Any-Precision DNN
- OQAT (2021):
- Problem: Above approaches implicitly constrain the weight distribution- using design loss function or knowledge transfer, and does not discuss the interaction of different bit-widths on weight-sharing quantization.
- This work identifies the problem of vicious competition between high and low bit-widths in supernet training due to the hard label.
3. Approach
3.1 Multi-Bit Quantization Modeling
- Formulation of multi-bit quantization
3.2 Training the Multi-Bit Quantization Supernet
1. Lowest-Random-Highest Bit-Width Co-Training 2. Online Adaptive Label
- OFA-like NAS vs MultiQuant (OFA network in general):
- OFA-like NAS: # of weights of biggest/smallest child models are different. -> Can isolate precision conflict by training differentiated parameters. (?)
- OFA network in general: Weights are completely shared under all subnets. -> Can only coordinate the weight distribution to adapt to different bit-width configurations.
- After gradient accumulation, parameters adapt to the 2-bit quantization distribution, which in turn impairs the 8-bit quantization accuracy.
- Variance of confidence score: 8-bit (high, bad) > 2-bit (low, good)
- Loss measured with hard label cross-entropy: 8-bit (smaller strength) < 2-bit (bigger strength)
- Proposal: Online adaptive label
- Summary: Soft labels based on the statistics of the LRH quantization model prediction are used to supervise the supernet.