EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
2024-07-06
Keywords: #noise #activationdistribution
0. Abstract
- The performance drop of quantization stems from the infromation distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism.
1. Introduction
- Nearly all QAT works focus on weight-only quantization, without quantizing the floating-point activations. → Cannot benefit from INT multipliers
- Difficulty of activation quantization lies in the pronounced outliers in activations.
3. Analysis
3.1 Quantized Self-Attention Module
- Quantize both the weights and activations for different parts of the model; MLP, Self-Attention, Part of the self-attention (Q & K).
- Quantization of self-attention leads to significant accuracy loss (89.8% → 55.1%), quantization of Q & K (56.6%) being the main reason.
- Variance between distributions of Q & K quantized and FP16 are very different.
3.2 Token Importance ??
- Attention map at the last layer inside FP16 and quantized models: Column pattern at the first initial token, which disappears after quantization
- A significant amount of attention is allocated to the initial token. The initial token is vital for producing text that is both coherent and contextually meaningful.