Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
In a major development, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have announced a framework that can handle both image recognition and image generation tasks with high accuracy. Officially dubbed Masked Generative Encoder, or MAGE, the unified computer vision system promises wide-ranging applications and can cut down on the overhead of training two separate systems for identifying images and generating fresh ones.
>>Follow VentureBeat’s ongoing generative AI coverage<<
The news comes at a time when enterprises are going all-in on AI, particularly generative technologies, for improving workflows. However, as the researchers explain, the MIT system still has some flaws and will need to be perfected in the coming months if it is to see adoption.
The team told VentureBeat that they also plan to expand the model’s capabilities.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
So, how does MAGE work?
Today, building image generation and recognition systems largely revolves around two processes: state-of-the-art generative modeling and self-supervised representation learning. In the former, the system learns to produce high-dimensional data from low-dimensional inputs such as class labels, text embeddings or random noise. In the latter, a high-dimensional image is used as an input to create a low-dimensional embedding for feature detection or classification.
>>Don’t miss our special issue: Building the foundation for customer data quality.<<
These two techniques, currently used independently of each other, both require a visual and semantic understanding of data. So the team at MIT decided to bring them together in a unified architecture. MAGE is the result.
To develop the system, the group used a pre-training approach called masked token modeling. They converted sections of image data into abstracted versions represented by semantic tokens. Each of these tokens represented a 16×16-token patch of the original image, acting like mini jigsaw puzzle pieces.
Once the tokens were ready, some of them were randomly masked and a neural network was trained to predict the hidden ones by gathering the context from the surrounding tokens. That way, the system learned to understand the patterns in an image (image recognition) as well as generate new ones (image generation).
“Our key insight in this work is that generation is viewed as ‘reconstructing’ images that are 100% masked, while representation learning is viewed as ‘encoding’ images that are 0% masked,” the researchers wrote in a paper detailing the system. “The model is trained to reconstruct over a wide range of masking ratios covering high masking ratios that enable generation capabilities, and lower masking ratios that enable representation learning. This simple but very effective approach allows a smooth combination of generative training and representation learning in the same framework: same architecture, training scheme, and loss function.”
In addition to producing images from scratch, the system supports conditional image generation, where users can specify criteria for the images and the tool will cook up the appropriate image.
“The user can input a whole image and the system can understand and recognize the image, outputting the class of the image,” Tianhong Li, one of the researchers behind the system, told VentureBeat. “In other scenarios, the user can input an image with partial crops, and the system can recover the cropped image. They can also ask the system to generate a random image or generate an image given a certain class, such as a fish or dog.”
Potential for many applications
When pre-trained on data from the ImageNet image database, which consists of 1.3 million images, the model obtained a fréchet inception distance score (used to assess the quality of images) of 9.1, outperforming previous models. For recognition, it achieved an 80.9% accuracy rating in linear probing and a 71.9% 10-shot accuracy rating when it had only 10 labeled examples from each class.
“Our method can naturally scale up to any unlabeled image dataset,” Li said, noting that the model’s image understanding capabilities can be beneficial in scenarios where limited labeled data is available, such as in niche industries or emerging technologies.
Similarly, he said, the generation side of the model can help in industries like photo editing, visual effects and post-production with the its ability to remove elements from an image while maintaining a realistic appearance, or, given a specific class, replace an element with another generated element.
“It has [long] been a dream to achieve image generation and image recognition in one single system. MAGE is a [result of] groundbreaking research which successfully harnesses the synergy of these two tasks and achieves the state of the art of them in one single system,” said Huisheng Wang, senior software engineer for research and machine intelligence at Google, who participated in the MAGE project.
“This innovative system has wide-ranging applications, and has the potential to inspire many future works in the field of computer vision,” he added.
More work needed
Moving ahead, the team plans to streamline the MAGE system, especially the token conversion part of the process. Currently, when the image data is converted into tokens, some of the information is lost. Li and team plan to change that through other ways of compression.
Beyond this, Li said they also plan to scale up MAGE on real-world, large-scale unlabeled image datasets, and to apply it to multi-modality tasks, such as image-to-text and text-to-image generation.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.