An autonomous vehicle must quickly and accurately identify the objects it encounters, from an idling delivery truck parked on a corner to a bicyclist turning into an approaching intersection.
To do this, the vehicle can use a powerful computer vision model to classify each pixel in the high-resolution image of the scene, so that it doesn’t ignore objects that might be blurred in the low-quality image. But this task, known as semantic segmentation, is complicated and requires a large amount of computation when the image resolution is high.
Researchers at MIT, the MIT-IBM Watson AI Lab, and elsewhere have developed more efficient computer vision models that greatly reduce the computational complexity of this task. Their model can accurately perform semantic segmentation in real-time on devices with limited hardware resources, such as on-board computers that enable autonomous vehicles to make split-second decisions.
Recent state-of-the-art semantic segmentation models directly learn the interactions between each pair of pixels in an image, so their computation increases quadratically as the image resolution increases. Because of this, although these models are accurate, they are too slow to process high-resolution images in real time on sensors or edge devices such as mobile phones.
MIT researchers have designed new building blocks for semantic segmentation models that achieve similar capabilities to these state-of-the-art models, but with only linear computational complexity and hardware-efficient operations.
The result is a new model series for high-resolution computer vision that performs nine times faster than previous models when deployed on mobile devices. Importantly, this new model series has demonstrated equal or greater accuracy than these alternatives.
Not only can this technique be used to help autonomous vehicles make decisions in real-time, but it can also improve the performance of other high-resolution computer vision tasks, such as medical image segmentation.
“While researchers have been using traditional vision transformers for a long time, and they produce amazing results, we want people to look at the functionality of these models. Our work shows that a drastic reduction in computation is possible so this real-time image segmentation can be done locally on the device,” says Song Han, associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing the new model.
He is joined on the paper by lead author Han Kai, an EECS graduate student; Junyan Li, a graduate of Zhejiang University; Muyan Hu, an undergraduate student at Tsinghua University; and Chuang Gan, principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the International Conference on Computer Vision.
A simple solution
Classifying each pixel in a high-resolution image containing millions of pixels is a daunting task for a machine-learning model. A powerful new type of model known as the Vision Transformer has recently been used to great effect.
Transformers were originally developed for natural language processing. In that context, they encode each word in a sentence as a token and then create an attention map, which captures the relationship of each token to all other tokens. This attention map helps the model understand context when making predictions.
Using the same concept, Vision Transformer cuts the image into patches of pixels and encodes each small patch into a token before creating an attention map. In creating this attention map, the model uses a similarity function that directly learns the interaction between each pair of pixels. In this way, the model develops what is known as a global receptive field, meaning it can access all relevant parts of the image.
A high-resolution image can contain millions of pixels, fragmented into thousands of patches, the focus map quickly becomes enormous. Because of this, the amount of computation increases quadratically as the image resolution increases.
In their new series of models called EfficientViT, the MIT researchers used a simpler mechanism to capture attention—replacing the nonlinear similarity function with a linear similarity function. Because of this, they can rearrange the sequence of operations to reduce the total computation without changing the efficiency and without losing the global receptive field. With their model, the amount of computation required for estimation increases linearly as image resolution increases.
“But there is no free lunch. Linear focus only captures global context about the image, losing local information, which degrades accuracy,” says Han.
To compensate for that loss of precision, the researchers included two additional factors in their model, each of which added a small amount of computation.
One of those factors helps the model capture the interactions of spatial features, reducing the weakness of the linear function in extracting spatial information. Second, a module that enables multiscale learning, helps the model recognize both large and small objects.
“The most important part here is that we need to carefully balance performance and efficiency,” says Cai.
They designed EfficientViT with a hardware-friendly architecture, so it can be easy to run on a wide variety of devices, such as virtual reality headsets or edge computers on autonomous vehicles. Their model can also be applied to other computer vision tasks such as image classification.
Streamlining Semantic Segmentation
When they tested their model on a dataset used for semantic segmentation, they found that it performed nine times faster than other popular Vision Transformer models on an Nvidia graphics processing unit (GPU), with the same or better accuracy.
“Now, we can have the best of both worlds and make it fast enough to run on mobile and cloud devices while reducing the computation,” Han says.
Building on these results, the researchers want to apply the technique to speed up generative machine-learning models used to generate new images. They want to continue to extend EfficientViT for other vision tasks.
“Efficient transformer models, pioneered by Professor Song Han’s team, now form the backbone of state-of-the-art techniques in a variety of computer vision tasks, including detection and segmentation,” says Lu Tian, AMD, Inc. Senior Director of AI Algorithms at J. Not involved in this paper. “Their research not only demonstrates the efficiency and potential of transformers, but also their enormous potential for real-world applications such as enhancing image quality in video games.”
“Model compression and light-weight model design are important research topics for efficient AI computing, especially in the context of large foundation models. “Professor Song Han’s group has demonstrated the compactness and speed of advances in modern deep learning models, particularly vision transformers,” adds Jay Jackson, global vice president of artificial intelligence and machine learning at Oracle, who was not involved in the research. “Oracle Cloud Infrastructure is supporting his team to advance this line of effective research toward efficient and green AI.”