Skip to main content Link Menu Expand (external link) Document Search Copy Copied

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

2024-07-06


Keywords: #noise #activationdistribution


0. Abstract

  • The performance drop of quantization stems from the infromation distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism.

1. Introduction

  • Nearly all QAT works focus on weight-only quantization, without quantizing the floating-point activations. → Cannot benefit from INT multipliers
  • Difficulty of activation quantization lies in the pronounced outliers in activations.

3. Analysis

3.1 Quantized Self-Attention Module

  • Quantize both the weights and activations for different parts of the model; MLP, Self-Attention, Part of the self-attention (Q & K).
  • Quantization of self-attention leads to significant accuracy loss (89.8% → 55.1%), quantization of Q & K (56.6%) being the main reason.
  • Variance between distributions of Q & K quantized and FP16 are very different.

3.2 Token Importance ??

  • Attention map at the last layer inside FP16 and quantized models: Column pattern at the first initial token, which disappears after quantization
  • A significant amount of attention is allocated to the initial token. The initial token is vital for producing text that is both coherent and contextually meaningful.

4. Methodology

4.1 Preliminary