Personalized deep learning models can power AI chatbots that adapt to understand a user’s pronunciation, or smart keyboards that are constantly updated to better predict the next word based on someone’s typing history. This adaptation requires continuous refinement of a machine learning model with new data.
Because smartphones and other cutting-edge devices lack the memory and computing power required for this refinement process, user data is typically uploaded to cloud servers where the model is updated. However, data transmission uses a large amount of energy, and sending sensitive user data to a cloud server poses a security risk.
Researchers from MIT, the MIT-IBM Watson AI Lab and elsewhere have developed a technique that allows deep learning models to efficiently adapt to new sensor data directly on an edge device.
The method of training them on the device, called PockEngine, determines which parts of a huge machine learning model need to be updated to improve accuracy, and stores and computes only with those specific parts. It performs most of these calculations while preparing the model, before runtime, which minimizes computational overhead and enhances the speed of the detailing process.
Compared to other methods, PockEngine significantly accelerated on-device training, performing up to 15 times faster on some hardware platforms. In addition, PockEngine did not cause the models to have any drop in accuracy. The researchers also found that their refinement method allowed a popular AI chatbot to more accurately answer complex questions.
“Detailing the device can enable better privacy, lower cost, customizability and also lifelong learning, but it’s not easy. Everything has to happen with a limited number of resources. We want to be able to perform not only inference but also training on an edge device. With PockEngine, now we can,” says Song Han, associate professor in the Department of Electrical Engineering and Computer Science (EECS), member of the MIT-IBM Watson AI Lab, NVIDIA Distinguished Scientist, and senior author of an open access paper describing PockEngine.
Han is joined by lead author Ligeng Zhu, an EECS graduate student, as well as others at MIT, the MIT-IBM Watson AI Lab, and the University of California, San Diego. The work was recently presented at the IEEE/ACM International Symposium on Microarchitecture.
Layer upon layer
Deep learning models are based on neural networks, which include many interconnected layers of nodes, or “neurons,” that process data to make a prediction. When the model is run, a process called inference, a data input (such as an image) is passed from layer to layer until the prediction (perhaps the image label) is output at the end. During inference, each layer no longer needs to be saved after processing the input.
But during training and refinement, the model undergoes a process known as backpropagation. In backpropagation, the output is compared to the correct answer and then the model is run in reverse. Each level is updated as the model output approaches the correct answer.
Because each layer may need to be updated, the entire model and intermediate results must be stored, making detail more memory-intensive than inference
However, not all layers in the neural network are important for improving accuracy. And even for layers that are important, the entire layer may not need to be updated. These layers, and pieces of layers, do not need to be saved. Furthermore, we may not need to go back to the first level to improve accuracy – the process could stop somewhere in the middle.
PockEngine takes advantage of these factors to speed up the minification process and reduce the amount of computation and memory required.
The system first tunes each level, one at a time, to a specific task and measures the improvement in accuracy after each individual level. In this way, PockEngine determines the contribution of each layer, as well as the trade-offs between accuracy and tuning cost, and automatically determines the percentage of each layer that needs to be fine-tuned.
“This method matches very well in accuracy compared to full propagation on different tasks and different neural networks,” Han adds.
A downgraded model
Conventionally, the backpropagation graph is generated during runtime, which involves a large number of calculations. Instead, PockEngine does this during compile time while the model is being prepared for deployment.
PockEngine cleans chunks of code to remove unnecessary layers or chunks of layers, creating a refreshed graph of the model to be used during runtime. It then performs other optimizations on this graph to further improve performance.
Since all this needs to be done only once, it saves computational cost for the runtime.
“It’s like before you start a hiking trip. At home, you would do careful planning — which paths to follow, which paths to ignore. So at the time of execution, when you actually hike, you already have a very careful plan to follow,” explains Han.
When they applied PockEngine to deep learning models on different cutting-edge devices, including Apple M1 Chips and digital signal processors common to many smartphones and Raspberry Pi computers, it ran training on the device up to 15 times faster, without any drop in accuracy . PockEngine also greatly reduced the amount of memory required for detailing.
The team also applied the technique to the large language model Llama-V2. With large language models, the refinement process involves providing lots of examples, and it’s important for the model to learn how to interact with users, Han says. The process is also important for models tasked with solving complex problems or reasoning about solutions.
For example, Llama-V2 models improved using PockEngine answered the question “What was Michael Jackson’s last album?” correctly, while unimproved models failed. PockEngine reduced the time required for each iteration of the tweaking process from approximately seven seconds to less than one second on an NVIDIA Jetson Orin, a state-of-the-art GPU platform.
In the future, the researchers want to use PockEngine to refine even larger models designed to process text and images together.
“This work addresses the increasing performance challenges posed by the adoption of large AI models such as LLMs in various applications across many different industries. It holds promise not only for edge applications that integrate larger models, but also for reducing the cost of maintaining and updating large AI models in the cloud,” says Ehry MacRostie, a senior director in Amazon’s AI division who was not involved. studies but collaborates with MIT on related AI research through the MIT-Amazon Science Hub.
This work was supported, in part, by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT-Amazon Science Hub, the National Science Foundation (NSF), and the Qualcomm Innovation Fellowship.