Post training 4-bit quantization of convolutional networks for rapid-deployment
2024-03-24
Keywords: #Activation #Batch Normalization
0. Abstract
- NN quantization often requires the full datasets, and time-consuming finetuning to recover accuracy lost after quantization.
- We introduce 4-bit post training quantization approach.
- It does not involve training the quantized model (finetuning)
- It does not require the availability of full dataset.
- We target both activation and weight quantization.
- We suggest three complementary methods for minimizing quantization error at the tensor level, two of whom obtain a closed-form analytical solution.
1. Introduction
- NN quantization usually involves some sort of training, yet it is not always applicable in real-world scenarios since it requires the full-size dataset.
- Post-Training Quantization (PTQ): Quantizing weights and activations post-training, withouth the need to re-train/finetune the model. Quantization with limited data.
- Problem: PTQ below 8 bits incurs significant accuracy degradation.
- Focuses: CNN PTQ to 4-bits. In the absence of a training set, we aim at minimizing the local error introduced during the quantization process (round-off errors)
- We adopt knowledge about the statistical characterization of NN distributions, which tend to have a bell-curved distribution around the mean.
- Contributions: We use ACIQ for activation quantization and bias-correction for quantizing weights.
- Analytical Clipping for Integer Quantization (ACIQ): Limit the range of activation values within the tensor. This reduces the rounding error in the part of the distribution containing most of the information. Our method approximates the optimal clipping value analytically from the distribution of the tensor by minimizing the mean-square-error measure. This analytical threshold is simple to use during run-time and can easily be integrated with other techniques for quantization.
- Per-channel bit allocation: Given a constraint on the average per-channel bit-width, our goal is to allocate for each channel the desired bit-width represntation so that overall mean-square-error is minimized.
- Bias-correction: We observe an inherent bias in the mean and the variance following their quantization. We suggest a simple method to compensate for this bias.