Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
Over the last 10 years, neural networks have taken a giant leap from recognizing simple visual objects to creating coherent texts and photorealistic 3D renders. As computer graphics get more sophisticated, neural networks help automate a significant part of the workflow. The market demands new, efficient solutions for creating 3D images to fill the hyper-realistic space of the metaverse.
But what technologies will we use to construct this space, and will artificial intelligence help us?
Neural networks emerge
Neural networks came into the limelight of the computer vision industry in September 2012, when the convolutional neural network AlexNet won the ImageNet Large Scale Visual Recognition Challenge. AlexNet proved capable of recognizing, analyzing and classifying images. This breakthrough skill caused the wave of hype that AI art is still riding.
Next, a scientific paper called Attention Is All You Need was published in 2017. The paper described the design and architecture of a “Transformer,” a neural network created for natural language processing (NLP). OpenAI proved the effectiveness of this architecture by creating GPT-3 in 2020. Many tech giants rushed to embark on a quest for a similar outcome and quality, and started training neural networks based on Transformers.
The ability to recognize images and objects and to create coherent text based on them led to the next logical step in the evolution of neural networks: Turning text input into images. This kick-started extensive research toward text-to-image models. As a result, the first version of DALL-E — a breakthrough achievement in deep learning for generating 2D images — was created in January 2021.
From 2D to 3D
Shortly before DALL-E, another breakthrough allowed neural networks to start creating 3D images with almost the same quality and speed as they managed to do in 2D. This became possible with the help of the neural radiance fields method (NeRF), which uses a neural network to recreate realistic 3D scenes based on a collection of 2D images.
Classic CGI has long demanded a more cost-efficient and flexible solution for 3D scenes. For context, each scene in a computer game consists of millions of triangles, and it takes a lot of time, energy and processing power to render them. As a result, the game development and computer vision industries are always trying to strike a balance between the number of triangles (the lower the number, the faster they can be rendered) and the quality of the output.
In contrast to the classic polygonal modeling, neural rendering reproduces a 3D scene based solely on optics and linear algebra laws. We see the world as three-dimensional because the sun’s rays reflect off objects and hit our retinas. NeRF models a space following the same principle, known as inverse rendering. Rays of light hit a specific point on the surface and approximate the light’s behavior in the physical world. Those approximated light rays have a certain radiance — color — and this is how NeRF decides which color to “paint” a pixel from knowing its coordinates on the screen. This way, any 3D scene becomes a function that depends on x, y and z coordinates and view direction.
NeRF can model a three-dimensional space of any complexity. The quality of the rendering also has a great advantage over the classic polygonal rendering, as it is astonishingly high. The output you get is not a CGI image, it’s a photorealistic 3D scene that doesn’t utilize polygons or textures and is free from all the other known downsides of the classic approaches to rendering.
Render speed: The main gatekeeper to neural 3D rendering
Even though the render quality is impressive when NeRF is involved, it’s still hard to implement in a real-world production setting as it doesn’t scale well and requires a lot of time. In classic NeRF, it takes from one to three days of training to recreate one scene. Then everything is rendered on a high-quality graphics card at 10 to 30 seconds per frame. This is still incredibly far from real-time or on-device rendering, so it’s too early to speak about the market use of the NeRF technology at scale.
However, the market is aware that such technology exists, and so a distinct demand for it exists, too. As a result, many improvements and optimizations have been implemented for NeRF during the last two years. The one discussed the most is Nvidia’s recent solution, Instant NeRF, created in March 2022. This approach considerably sped up the learning for static scenes. With it, the training period takes not two days but somewhere between several seconds and several minutes, and it’s possible to render several dozen frames per second.
However, one issue remains unresolved: How to render dynamic scenes. Also, to commoditize the technology and make it appealing and available to the broader market, it still needs to be improved and made usable on less specialized equipment, like personal laptops and workstations.
The next big thing: Combining generative transformers and NeRF
Just as the Transformer once boosted the development of NLP for multimodal representations and made it possible to create 2D images from text descriptions, it could just as quickly boost the development of NeRFs and make them more commoditized and widespread. Just imagine that you could turn a text description into three-dimensional objects, which could then be combined into full-scale dynamic scenes. This may sound fantastical, but it’s a totally realistic engineering task for the near future. Solving this issue could create a so-called “imagination machine” capable of turning any text description into a complete and dynamic 3D narrative, making it possible for the user to move around or interact with the virtual space. It sounds very much like the metaverse, doesn’t it?
However, before this neural rendering becomes useful in the metaverse of the future, there are real tasks for it today. These include rendering scenes for games and films, creating photorealistic 3D avatars, and transferring objects to digital maps, the so-called photo tourism, where you can get inside a three-dimensional space of any object for a fully immersive experience. Later, after the technology is optimized and commoditized, neural 3D rendering may become just as common and accessible to everyone as photo and video filters and the masks in smartphone apps we use today.
Olas Petriv is CTO and co-founder at Reface.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!