What are dynamic neural networks? - Flexible bit precision models

2024-07-11

The term dynamic neural network is used a considerable amount of times in my portfolio, which led me to write this post. My research interests cannot be fully explained without a definition or, at the very least, an example of what dynamic neural network entails. That is, to say, dynamic neural networks aren’t really well-known to the general public as it is a relatively new concept, so there aren’t (to my knowledge) any concrete definitions to what it is.

This post will most likely introduce dynamic neural networks through the lens of my own personal interest- to which I shamelessly admit- however, I hope this can help understand the gist of what I’m currently working on!

Why Dynamic Neural Networks?

Dynamic neural networks are NNs (neural networks) that adaptively change to their given environment during runtime. But, this can mean so many things? Therefore, we need to get our whys straight. Why do we need dynamic NNs?

Most introductions or problem statements in NeuRIPS, ICML, ICLR, CVPR papers on dynamic NNs start with something like this:

… Deep neural networks come with a high demand of computational and memory resources…

Indeed, AI models nowadays are getting ridiculously large, with billions (and even trillions!) of parameters. This statement is generally followed by the next:

… In order to reduce these costs, past works study model compression techniques such as pruning and quantization, or search for a whole new model architecture…

This statement is usually tagged with seminal papers that pioneered the area of TinyML such as the Deep Compression paper and Jacob’s quantization paper. From now on, we look into the specific case where quantization is used for model compression.

… Quantization incurs an error between the original and quantized value, leading to a degradation in performance. This leads to the study of post-training quantization (PTQ) and quantization-aware training (QAT), in which both methods aim to recover the accuracy drop in their own respective ways.

While there are pros and cons of PTQ and QAT, we will not delve deeply into those in this post (maybe in another?). However, it helps to know that for small networks, the QAT method- which trains the impact of quantization beforehand with the full training set- is preferred due to better accuracy in low precisions. On the other hand, for big networks and transformers, PTQ- which calibrates the network quantization parameters with a small subset of the training data- is preferred due to the relatively low training cost (in comparison to QAT, where we have to retrain the whole model). I’m currently studying transformer quantization methods which makes PTQ also very exciting, but for the sake of this post, I will continue with the case of QAT which comprises the bulk of my research until now. The true motivation behind dynamic NNs comes after:

… However, models trained via quantization-aware-training (QAT) are generally dependent to their quantization settings and training methods, and are not robust during runtime …

Which is the punch-line for dynamic NNs, in my opinion. Quantization settings in this context is usually referred to the quantization granularity (bit precision; INT8, INT4, etc.), or quantization symmetry (symmetric or asymmetric). To really get the point across, I present the following example.

Dynamic Neural Network example

Let’s say we want to deploy an AI model to our smartphone. When the original model is in FP32, by quantizing to INT8, we can compress the model to about 4 times its original size, while simultaneously reducing computational costs (assuming computations are done in INT-only). We can also quantize the model to INT4, which can compress the model to about 8 times its original size, and further reduce computation. The natural thought process is then wanting to deploy the model in both INT8 and INT4, so that when resource is plentiful we use the INT8 model for better accuracy, but when in dire, the INT4 model is used for better efficiency. We must remember that model performance and model power consumption (in terms of size and computation) are in a trade-off relationship.

Unfortunately, QAT tunes the model to either INT8 or INT4 quantization, it does not tune for both (yet). Hence, why most models are not robust for all quantization settings during runtime. Dynamic NNs, in this sense, are robust neural networks in runtime, which allows the model to maintain accuracy even when deployed in quantization settings other than the one it was trained for. For instance, with dynamic neural networks, we can train a single model, and deploy it in INT8 and enjoy high accuracy, while at the same time, switch to INT4 and enjoy efficiency, with little to no loss in performance.

Dynamic Neural Networks oustide quantization

But really, the definition and example previously shown is only a small fraction of the large concept we call dynamic NNs. To be exact, the example above goes by the name flexible bit precision models- NNs that change their quantization granularity in runtime. Outside the quantization domain, dynamic NNs can take other shapes, such as networks that see different datasets and match accordingly with little to no loss in runtime, or networks that seamlessly change between different tasks in runtime. Whatever the domain, dynamic NNs need to satisfy two needs:

It is robust to changes (hardware availability, datasets, tasks) in runtime.
The implementation is resource-efficient.