On-Device Training Under 256KB Memory

2024-01-20

Keywords: #Quantization

0. Abstract

On-device training faces two unique challenges:
1. The quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization
2. The limited hardware resource does not allow full back-propagation

1. Introduction

On-device training allows us to adapt the pre-trained model to newly collected sensory data after deployment.
Issue:
1. Memory constraint (limited SRAM size)
2. MCs are bare metal and do not have an OS
Proposal:
1. Quantization-Aware Scaling
2. Sparse Update
3. Tiny Training Engine
Contribution:
1. Our solution enables weight update not only for the classifier, but also for the backbone.
2. Our system-algorithm co-design scheme reduces the memory footprint.
3. Our framework greatly accelerates training

2. Approach

Preliminaries

To keep the memory efficiency, we update the real quantized graph, and keep the update weights as $\text{int8}$
The gradient computation is also performed in $\text{int8}$ for better computation efficiency.
Real quantized graph vs Fake quantized graph: The fake quantization graph uses $\text{fp32}$, leading to no memory or computation savings. → Real quantized graphs are for efficiency, while fake quantized graphs are for simulation.

2.1 Optimizing Real Quantized Graphs

Training with a real quantized graph is difficult: the quantized graph has tensors of different bit-precisions ($\text{int8, int32, fp32}$) and lacks Batch Normalization layers → Unstable gradient update

Gradient scale mismatch.