NIPQ: Noise proxy-based Integrated Pseudo-Quantization
2024-07-06
Keywords: #noise #activationdistribution
1. Introduction
- STE: The quantization operator is not differentiable, but the STE allows the backprop of the quantized data based on linear approximation.
- STE-based QAT schemes can be quantized into 4-bit w/o accuracy loss.
- However, STE-based QAT bypasses the approximated gradient, not the true gradient. → Incurs instability and bias during training.
- Pseudo-quantization training (PQT): Based on pseudo-quantization noise (PQN)
- The behavior of quantization operator is simulated via PQN
- Learnable parameters are updated based on the proxy of quantization.
- Truncation contributes significantly to reducing quantization errors.
- Proposal
- NIPQ is the first PQT that integrates truncation. → Reduces weight quantization error, but also enables PQT for activation quantization.
- NIPQ optimizes network into mixed-precision with awareness of the given resource constraint w/o human intervention.
- Theoretical analysis showing that NIPQ updates parameters toward minimizing the quantization error.
- Experiments w/ NIPQ
4. Motivation
4.1 Limitation of STE-based Quantization
- Biggest problem of STE: Parameters never converge to the target value; instead it oscillates near the rounding boundary of two adjacent quantization levels.
- For more info: Overcoming Oscillations in Quantization-Aware Training
- Oscillation near the rounding boundary becomes the major souce of large quantization errors.
4.2 Pros and Cons of Previous PQN-based PQT
- Pro
- PQN-based PQT is expected to have lower quantization error after the training. (How? not very clear in the paper)
- Con
- Existing studies do not provide the theoretical integration of trucation on top of a PQN-based PQT framework. (Again, How? not very clear)