EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge | Jinhee Kim

EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge

2024-07-06

Keywords: #noise #activationdistribution

0. Abstract

The performance drop of quantization stems from the infromation distortion in quantized attention maps, demonstrated by the different distributions in quantized query and key of the self-attention mechanism.

1. Introduction

Nearly all QAT works focus on weight-only quantization, without quantizing the floating-point activations. → Cannot benefit from INT multipliers
Difficulty of activation quantization lies in the pronounced outliers in activations.

3. Analysis

3.1 Quantized Self-Attention Module

Quantize both the weights and activations for different parts of the model; MLP, Self-Attention, Part of the self-attention (Q & K).
Quantization of self-attention leads to significant accuracy loss (89.8% → 55.1%), quantization of Q & K (56.6%) being the main reason.
Variance between distributions of Q & K quantized and FP16 are very different.

3.2 Token Importance ??

Attention map at the last layer inside FP16 and quantized models: Column pattern at the first initial token, which disappears after quantization
A significant amount of attention is allocated to the initial token. The initial token is vital for producing text that is both coherent and contextually meaningful.

4. Methodology

4.1 Preliminary