Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
For several years, Meta’s chief AI scientist Yann LeCun has been talking about deep learning systems that can learn world models with little or no help from humans. Now, that vision is slowly coming to fruition as Meta has just released the first version of I-JEPA, a machine learning (ML) model that learns abstract representations of the world through self-supervised learning on images.
Initial tests show that I-JEPA performs strongly on many computer vision tasks. It is also much more efficient than other state-of-the-art models, requiring a tenth of the computing resources for training. Meta has open-sourced the training code and model and will be presenting I-JEPA at the Conference on Computer Vision and Pattern Recognition (CVPR) next week.
The idea of self-supervised learning is inspired by the way humans and animals learn. We obtain much of our knowledge simply by observing the world. Likewise, AI systems should be able to learn through raw observations without the need for humans to label their training data.
Self-supervised learning has made great inroads in some fields of AI, including generative models and large language models (LLMs). In 2022, LeCun proposed the “joint predictive embedding architecture” (JEPA), a self-supervised model that can learn world models and important knowledge such as common sense. JEPA differs from other self-supervised models in important ways.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
>>Don’t miss our special issue: Building the foundation for customer data quality.<<
Generative models such as DALL-E and GPT are designed to make granular predictions. For example, during training, a part of a text or image is obscured and the model tries to predict the exact missing words or pixels. The problem with trying to fill in every bit of information is that the world is unpredictable, and the model often gets stuck among many possible outcomes. This is why you see generative models fail when creating detailed objects such as hands.
In contrast, instead of pixel-level details, JEPA tries to learn and predict high-level abstractions, such as what the scene must contain and how objects relate to each other. This approach makes the model less error-prone and much less costly as it learns the latent space of the environment.
“By predicting representations at a high level of abstraction rather than predicting pixel values directly, the hope is to learn directly useful representations that also avoid the limitations of generative approaches,” Meta’s researchers write.
I-JEPA is an image-based implementation of LeCun’s proposed architecture. It predicts missing information by using “abstract prediction targets for which unnecessary pixel-level details are potentially eliminated, thereby leading the model to learn more semantic features.”
I-JEPA encodes the existing information using a vision transformer (ViT), a variant of the transformer architecture used in LLMs but modified for image processing. It then passes on this information as context to a predictor ViT that generates semantic representations for the missing parts.
The researchers at Meta trained a generative model that creates sketches from the semantic data that I-JEPA predicts. In the following images, I-JEPA was given the pixels outside the blue box as context and it predicted the content inside the blue box. The generative model then created a sketch of I-JEPA’s predictions. The results show that I-JEPA’s abstractions match the reality of the scene.
While I-JEPA will not generate photorealistic images, it can have numerous applications in fields such as robotics and self-driving cars, where an AI agent must be able to understand its environment and handle a few highly plausible outcomes.
A very efficient model
One obvious benefit of I-JEPA is its memory and compute efficiency. The pre-training stage does not require the compute-intensive data augmentation techniques used in other types of self-supervised learning methods. The researchers were able to train a 632 million-parameter model using 16 A100 GPUs in under 72 hours, about a tenth of what other techniques require.
“Empirically, we find that I-JEPA learns strong off-the-shelf semantic representations without the use of hand-crafted view augmentations,” the researchers write.
>>Follow VentureBeat’s ongoing generative AI coverage<<
Their experiments show that I-JEPA also requires much less fine-tuning to outperform other state-of-the-art models on computer vision tasks such as classification, object counting and depth prediction. The researchers were able to fine-tune the model on the ImageNet-1K image classification dataset with 1% of the training data, using only 12 to 13 images per class.
“By using a simpler model with less rigid inductive bias, I-JEPA is applicable to a wider set of tasks,” the researchers write.
Given the high availability of unlabeled data on the internet, models such as I-JEPA can prove to be very valuable for applications that previously required large amounts of manually labeled data. The training code and pre-trained models are available on GitHub, though the model is released under a non-commercial license.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.