Skip to main content Link Menu Expand (external link) Document Search Copy Copied

On-Device Training Under 256KB Memory

2024-01-20


Keywords: #Quantization


0. Abstract

  • On-device training faces two unique challenges:
    1. The quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization
    2. The limited hardware resource does not allow full back-propagation

1. Introduction

  • On-device training allows us to adapt the pre-trained model to newly collected sensory data after deployment.
  • Issue:
    1. Memory constraint (limited SRAM size)
    2. MCs are bare metal and do not have an OS
  • Proposal:
    1. Quantization-Aware Scaling
    2. Sparse Update
    3. Tiny Training Engine
  • Contribution:
    1. Our solution enables weight update not only for the classifier, but also for the backbone.
    2. Our system-algorithm co-design scheme reduces the memory footprint.
    3. Our framework greatly accelerates training

2. Approach

Preliminaries

  • To keep the memory efficiency, we update the real quantized graph, and keep the update weights as $\text{int8}$
  • The gradient computation is also performed in $\text{int8}$ for better computation efficiency.
  • Real quantized graph vs Fake quantized graph: The fake quantization graph uses $\text{fp32}$, leading to no memory or computation savings. → Real quantized graphs are for efficiency, while fake quantized graphs are for simulation.

2.1 Optimizing Real Quantized Graphs

Training with a real quantized graph is difficult: the quantized graph has tensors of different bit-precisions ($\text{int8, int32, fp32}$) and lacks Batch Normalization layers → Unstable gradient update

Gradient scale mismatch.