Baidu Text to Speech Recognition 06.2.17

Baidu’s Text-to-Speech Technology Does Perfect Voice Impressions

The capability to detect and emulate slight differences in accents and voice characteristics is a fundamental human trait that allows people to understand social identity and fit in with different groups. However, systems like digital assistants often come up short when it comes to understanding and conveying these subtleties. Enter Chinese internet giant Baidu Inc., which says the latest version of its Deep Voice text-to-speech technology can imitate hundreds of voices with 100 percent accuracy.

This accomplishment may lead to the creation of voice-command systems that adjust to their users’ manner of speaking, resulting in more natural interactions.

In a research paper released in May, Baidu described how its Deep Voice 2 text-to-speech technology can listen to hundreds of voices to learn certain speaking styles. After less than 30 minutes of time listening to each speaker, Deep Voice 2 then can recreate the style perfectly, as presented on the company’s website.

In comparison, the first iteration of Deep Mind released in February needed 20 hours of training to achieve similar results. Baidu said Deep Voice 2 learns to generate speech by identifying common attributes among different voices. Each voice corresponds to a single vector that consists of 50 numbers that summarize how to imitate the speaker. In contrast to other text-to-speech technologies, Deep Voice 2 can learn these attributes on its own, without requiring guidance on how to distinguish the voices.

Advancements such as Baidu’s latest version of Deep Mind will help play a role in the rapid advancement of the digital assistant market. The global installed base of devices supporting AI digital assistants amounted to 3.5 billion in 2016, according to the market research firm Ovum. By 2021, this total will expand to 7.5 billion, exceeding the population of the planet.
These devices include smartphones, tablets, wearables, smart-home systems and televisions.

Asia will represent a major growth area of AI-capable voice devices, Ovum predicts. While much of the installed base is now in North America, by 2021, nearly 50 percent of such devices will be in Asia and Oceania, Ovum predicts. Baidu’s home market of China will generate particularly strong growth.

“With an active installed base close to 1.2 billion devices in 2021, digital assistants of Chinese origin are set to be as powerful as Apple’s Siri or Samsung’s Bixby,” Ovum stated in a press release. “They already accounted for close to 43 million devices in 2016, led by companies such as Baidu and iFlytek.”

Tyler Schulze is vice president, strategy & development at Veritone.  He serves as general manager for developer partnerships, cognitive engine ecosystem, and media ingestion for the Veritone platform. Learn more about our platform and join the Veritone developer ecosystem today.