DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

2024-01-12

Keywords: #Quantization

0. Abstract

Proposal: A method to train NNs that have low bitwidth weights/ activations using low bitwidth parameter gradients.

1. Introduction

BNN, XNOR-Net: Both weights and activations of conv layers are binarized. → Computationally expensive convolutions can be done by bitwise operation kernels during forward pass.
Dot product of bit vectors $x$ and $y$ using bitwise op.

$\begin{align*} \mathbf{x} \cdot \mathbf{y}=\operatorname{bitcount}(\operatorname{and}(\mathbf{x}, \mathbf{y})), x_i, y_i \in\{0,1\} \forall i \end{align*}$

When $x$ and $y$ are vectors of ${-1, 1}$,

$\begin{align*} \mathbf{x} \cdot \mathbf{y}=N-2\operatorname{bitcount}(\operatorname{xnor}(\mathbf{x}, \mathbf{y})), x_i, y_i \in\{-1,1\} \forall i \end{align*}$

No previous work has succeeded in quantizing gradients to numbers with bitwidth less than 8 during backward pass.
Contributions
1. DoReFa-Net: Generalize the method of binarized NN, bit convolution kernels to accelerate both forward and backward pass of the training process.
2. Bit convolution can be implemented on CPU, FPGA, ASIC and GPU. Considerably reduces energy consumption of low bitwidth NN training
3. Quantization sensitivity: gradients > activations > weights

2. DoReFa-Net

A method to train NN that has low bw(bitwidth) weights/activation with low bw parameter gradients.
While weights/activations can be deterministically quantized, gradients need to be stochastically quantized.

2.1 Using Bit Convolution Kernels in Low Bitwidth NN

Computation complexity: $O(MK)$

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

0. Abstract

1. Introduction

2. DoReFa-Net

2.1 Using Bit Convolution Kernels in Low Bitwidth NN

2.2 Straight-Through Estimator

2.3 Low Bitwidth Quantization of Weights