One of the reasons why text-based forms of communication are popular is their searchability; it’s easy to use search tools to comb through written words and find specific keywords and phrases. However, when it comes to spoken words, this kind of automated searching was impossible—until now. A new app launched in February makes it easy to search voice conversations—as easy as searching emails and texts.
Called Otter, the app is a voice recorder that incorporates automatic transcription. In contrast to voice assistants like Alexa or Google Assistant that are designed to support brief interactions with users, Otter was created specifically to handle lengthy communications, such as teleconferences.
“The existing technologies are not good enough for human-to-human conversations,” said Sam Liang, CEO and founder of Otter’s creator, AISense Inc., in a quote from TechCrunch. “Google’s voice API has been trained to optimize voice search,” he said, noting that when people talk to voice assistants, it’s usually just only one person speaking and they tend to talk slowly and more carefully than they usually do. Users of voice assistants also ask brief questions and don’t engage in long conversations.
“Human meetings are much more complicated,” Liang said. “It usually involves at least two people, and the people could talk for an hour. It’s a long-form conversation.”
To accomplish this, AISense built its own AI technology stack from the ground up, including the speech recognition. This is because speech recognition application programming interfaces (APIs) offered by other companies frequently aren’t suited to conversations involving multiple speakers.
The development of Otter was made possible by recent advances in AI technology.
“Four years ago, there were tremendous advances in deep learning and A.I., and suddenly, the accuracy became much higher,” Liang noted. “It also requires a lot of CPU power, GPU power, and a lot of storage…these became much more affordable today compared to five or ten years ago.”
AISense has already licensed its transcription technology to a company called Zoom Video Communications. Zoom, which provides cloud-based video and web conferencing services, recently added a new feature called Recording Transcripts. Recording Transcripts transcribes the speech from conference recordings into text.
The text is searchable, allowing users to search the transcript for a keyword. Users then can jump to that keyword in the recording, allowing them to understand the context surrounding the term.
The transcription feature can even identify the names of individual speakers on the conference.
Tyler Schulze is vice president, strategy & development at Veritone. He serves as general manager for developer partnerships, cognitive engine ecosystem, and media ingestion for the Veritone platform. Learn more about our platform and join the Veritone developer ecosystem today.