Humans don’t have just one sense; we possess at least nine—and we need them all to navigate and understand the world. What’s more, the human brain is adept at using these senses in combination, like detecting a large truck is nearby both by hearing the characteristic sound and by feeling the intense vibration. Now, new research is exploring ways to make artificial intelligence utilize and combine different types of sensory inputs, giving them a more sophisticated awareness of real-world events than object recognition, text recognition and audio recognition alone.
Two new papers from MIT and google are blazing a trail toward building AI systems that can see, hear and read holistically, according to a story in Quartz.
This effort to align inputs different senses represents a new direction for AI. Cognitive engines focus on one specific sense, such as translation algorithms that operate strictly in the realm of sound, or object-recognition software that deal only with visual phenomena. This new research allows AI systems to link and align the knowledge they gather from senses, accelerating their ability to learn.
“It doesn’t matter if you see a car or hear an engine, you instantly recognize the same concept. The information in our brain is aligned naturally,” said MIT researcher Yusuf Aytar, in comments to NBC News.
The MIT team is cultivating this kind of cross-sensory learning by showing a neural network video frames that were related to certain sounds. Using audio and object-recognition, the network began to predict which objects were associated with certain sounds.
The team then had the network ingest images with captions depicting similar situations to help the algorithm perform text recognition to link words with the objects and their actions. Again, the network was able to identify the objects and match them to the relevant words.
The system then was able to match audio to text, although it had never been specifically trained to know which words matched the various sounds. This indicated that the network had “built a more objective idea of what it was seeing, hearing, or reading, one that didn’t entirely rely on the medium it used to learn the information,” Quartz noted.
The neural network can extrapolate inputs, for example visualizing what it hears or reads to make new connections and gain a deeper understanding of the world.
Google is taking a similar approach but adds the capability to translate text.
With the development of technologies that give AI a multi-sensory view of the world, future machines may perceive the world in the same multifaceted way that humans do.
Stephan Cunningham is vice president, product management at Veritone. Working in concert with core internal teams including industry-specific general managers and engineering as well as directly with clients and prospects, he leads the disciplines and business processes which govern the Veritone Platform.