Each day brings the arrival of a new AI cognitive engine designed to process a form of communication—such as speech transcription or text analytics. But what about the aspects of human interaction that are expressed wordlessly, using the subtle and often unconscious language of gestures? Google now has a technology for that—a gesture recognition algorithm that can observe human activities to learn certain actions and interpret their meaning.
The algorithm now is being trained to detect certain common motions, such as closing a door, answering a phone or giving a kiss, as reported by The Sun. The training involves having the algorithm look at a set of YouTube videos that portray certain activities frequently performed by individuals.
Google said its gesture recognition technology will help it understand “what humans are doing, what they might do next and what they are trying to achieve,” according to The Sun.
Experts speculate that the gesture recognition technology could be used to help Google advertisers improve their ability to target consumers based on what they are watching. For example, if a viewer is enjoying a kickboxing event on YouTube, the system could present ads showing martial arts.
Video surveillance may also benefit from this innovation, observing human behavior and predicting what a person’s next action might be.
“Despite exciting breakthroughs made over the past years in classifying and finding objects in images, recognizing human actions still remains a big challenge,” Google stated in a blog. “This is due to the fact that actions are, by nature, less well-defined than objects in videos, making it difficult to construct a finely labeled action video dataset.”
Google noted that no dataset previously existed that depicted complex scenes showing multiple people who could be performing different actions. To address this gap, the company has released AVA, which stands for “atomic visual actions.” AVA provides many action labels for each person shown in extended video sequences.
“AVA consists of URLs for publicly available videos from YouTube, annotated with a set of 80 atomic actions (e.g. “walk”, “kick (an object)”, “shake hands”) that are spatial-temporally localized, resulting in 57.6k video segments, 96k labeled humans performing actions, and a total of 210k action labels,” Google stated.
AVA was created using video from films and television shows featuring professional actors. Google analyzed a 15-minute clip from each video and partitioned it into 3-second segments depicting distinct activities.
“We hope that the release of AVA will help improve the development of human action recognition systems, and provide opportunities to model complex activities based on labels with fine spatio-temporal granularity at the level of individual person’s actions,” Google stated.
Tyler Schulze is vice president, strategy & development at Veritone. He serves as general manager for developer partnerships, cognitive engine ecosystem, and media ingestion for the Veritone platform. Learn more about our platform and join the Veritone developer ecosystem today.