NIPQ: Noise proxy-based Integrated Pseudo-Quantization

2024-07-06

Keywords: #noise #activationdistribution

1. Introduction

STE: The quantization operator is not differentiable, but the STE allows the backprop of the quantized data based on linear approximation.
- STE-based QAT schemes can be quantized into 4-bit w/o accuracy loss.
- However, STE-based QAT bypasses the approximated gradient, not the true gradient. → Incurs instability and bias during training.
Pseudo-quantization training (PQT): Based on pseudo-quantization noise (PQN)
- The behavior of quantization operator is simulated via PQN
- Learnable parameters are updated based on the proxy of quantization.
Truncation contributes significantly to reducing quantization errors.
Proposal
1. NIPQ is the first PQT that integrates truncation. → Reduces weight quantization error, but also enables PQT for activation quantization.
2. NIPQ optimizes network into mixed-precision with awareness of the given resource constraint w/o human intervention.
3. Theoretical analysis showing that NIPQ updates parameters toward minimizing the quantization error.
4. Experiments w/ NIPQ

Biggest problem of STE: Parameters never converge to the target value; instead it oscillates near the rounding boundary of two adjacent quantization levels.
For more info: Overcoming Oscillations in Quantization-Aware Training
Oscillation near the rounding boundary becomes the major souce of large quantization errors.
4.2 Pros and Cons of Previous PQN-based PQT
Pro
- PQN-based PQT is expected to have lower quantization error after the training. (How? not very clear in the paper)
Con
- Existing studies do not provide the theoretical integration of trucation on top of a PQN-based PQT framework. (Again, How? not very clear)