How do you teach computers to understand language — not just transcribe human speech, but actually comprehend what someone is saying? It’s one of the grand challenges of AI, and we still don’t really know the best way to tackle the problem. Facebook’s AI research lab, FAIR, has one idea: teach AIs to understand language by getting them to guide virtual tourists around New York City.
FAIR is releasing what it calls Talk the Walk, a dataset designed to be used by other researchers. It’s comprised of three elements: small maps of New York City neighborhoods (each a couple of blocks wide), 360-degree photos of the same locations, and sample dialogues of humans guiding one another around these neighborhoods. Basically, it’s everything you might need to teach an AI to tackle this task itself.
This may all sound a little odd as a method of training AI, but FAIR is tapping into an established field of research known as “grounded language learning” or “embodied learning.” This theory says that the only way we can teach AI to understand language like humans is to get them to learn like we do — in the real world.
Speaking to The Verge, FAIR researcher Douwe Kiela compares current training methods to giving someone a dictionary of a foreign language and expecting them to teach themselves. “With natural language processing, what we tend to do is take a large corpus like Wikipedia and [get AI] to look for statistical patterns, which is very different to how humans learn,” says Kiela. “Humans learn language efficiently because we can relate our experiences to the world around us.”
Of course, small slices of New York City are not representative of the whole world. But the idea is that if we can get AI to succeed at this particular task, the techniques researchers use will be applicable elsewhere. This is an established way to drive progress in AI, and notable datasets (like ImageNet) are often credited with pushing the whole field forward.
FAIR’s researchers suggest that teams try to teach two AI agents to navigate their virtual New York maps. One agent would be a “tourist” who can see the 360-degree photos but not the map, and the other a “guide” who can see the map but not the photos. The agents then have to talk to one another to establish the tourist’s location and help them navigate to another point on the map. The tourist would look for nearby landmarks, like restaurants, bars, and coffee shops, and the guide would give them directions.
Think of Talk to Walk like one of those early fantasy adventure games where you’re faced with a dungeon corridor, and you have to make a choice like “go north” or “go south” or “turn around.” But instead of exploring a dungeon to find treasure, you’re stuck in NYC’s Financial District looking for a hairdresser called Snip Dogg.
The researchers at FAIR say they haven’t been able to create AI agents that can tackle this problem yet. (Why? “Because it’s super hard!” says Kiela.) But they expect teams to start building bots that can guide virtual tourists sort of competently in the next few years. FAIR has established baseline results for a sub-task known as “localization,” which means getting the tourist AI to convey to the guide AI where they are on the map.
The overall Talk to Walk task is challenging because it combines so many different elements of AI perception and language. Agents need to be able to recognize their surroundings, convey that information, and then interact with the world. “The end goal is to have AI assistants that understand humans better because they understand the world better,” says Kiela. “That’s something that’s applicable to Facebook and to any company in the world.”