As the third installment in our Deepfake Voice series, we will focus on text-to-speech synthesis and how artificial intelligence (AI) has evolved to make accurate, realistic voice outputs. Briefly touched on in our previous blog on voice cloning, text-to-speech is an approach that has historically been more limited than what voice cloning technology can do today.
To understand the differences, we’ll cover in this blog:
- A quick history on text-to-speech synthesis
- The different techniques used for text-to-speech
- A new era in TTS: deep learning and AI
- Breaking down the process of AI-powered synthetic speech
- What is the best TTS voice, and how can you create one?
A quick history on text-to-speech synthesis
The idea of a speech synthesis machine dates back to the 1700s, with development continuing into the 19th and 20th centuries. Advancements in speech synthesizers in the 1920s paved the way for the development of the first text-to-speech system.
The complete text-to-speech system, developed for English, was built in 1968 in Japan at the Electrotechnical Laboratory. Noriko Umeda and his colleagues made a system that could produce intelligible speech outputs, but it was far from authentic-sounding voices, which future technology advancements would help achieve.
Throughout the 70s and 80s, advancements in the technology continued. First premiering in 1976, the Kurzweil Reading Machines for the Blind was a reading aid using an optical scanner, which was quite adequate for the visually impaired.
And just before the end of the decade, specialists at MIT developed a more sophisticated text-to-speech system. The system would act as the foundational technology that many organizations base their speech synthesis systems on.
The different techniques used for text-to-speech
Two basic techniques are commonly used in the application of text-to-speech synthesis. These two techniques are:
- Concatenation synthesis: this technique takes the approach of concatenating or linking together short samples of recorded sound in a chain called units. These units are then used to generate user-defined patterns of sound.
- Formant synthesis: using formants, a term to describe what our vocal tracts produce characteristically, one can replicate sounds. This technique is most often used to replicate the sounds we make with vowels.
The drawback with these approaches is that they result in the artificial sounding voices that we are trying to avoid. But further technological advancements would make authenticity possible, bringing us to the TTS technology we have today.
A new era in TTS: deep learning and AI
Thanks to advancements in AI, machine, and deep learning, TTS technology has advanced to the point where we can interact with it on a daily basis. You’ve probably experienced the technology if you’ve worked with virtual assistants or bots.
You might have used it with your smart device as well when you’ve asked it to read something back to you. As a result, our interactions with technology are moving closer to human communication. But to make the technology sound less robotic and more human-like, we, ironically, need AI.
AI text-to-speech is often called neural text to speech, which leverages neural networks and machine learning technologies to create synthesized speech outputs from the text. While past techniques like concatenation synthesis strung together sequences, the approach could not account for the multitude of speech variations, limiting the overall quality.
Breaking down the process of AI-powered synthetic speech
Here’s how AI speech synthesis works. First, the speech engines take the audio input and recognize sound waves produced by a human voice. This information is then translated into language data, which is called automatic speech recognition (ASR). After acquiring this data, it must then analyze it to understand the meaning of the words it has collected, which is called natural-language generation (NLG).
AI has advanced to the point where it can understand human communication to the point where it can determine the appropriate response. AI does this by analyzing a large volume of human speech. So once the TTS engine has created the answer in text form, this can then be translated back into speech. After understanding the context of the text response, it produces the necessary speech sounds or phonemes.
Using AI-powered TTS over older methodologies has enabled a more accurate sequence of phonemes, organizing them based on their different frequency bands.
For this reason, the technology yields the following benefits:
- A more natural-sounding voice, accurately capturing things like intonation
- Can produce voices with realistic accents
- A more human output that improves the capabilities for learning new languages
- Helping the visually impaired and giving people back their voices who have lost them for medical reasons
What is the best TTS voice, and how can you create one?
There are several areas you should be concerned about when conducting your search for the best TTS voice. You’ll want to determine:
- How accurate sounding it is, especially if you have language needs outside of English?
- How easy is it to use, especially if you have a large amount of content you need to translate into a TTS voice?
- Does it cost more to use additional voices or to use different languages?
Veritone MARVEL.ai leverages the latest TTS methodologies to create authentic-sounding voices. With a user-friendly interface, you can easily leverage a library of professional voices to translate text content into authentic sounding voices.
We also offer a service to enable the creation of custom voices. Working with our professional services team, we can create a hyper-realistic version of your voice that you will have complete control over. We put ethical use at the forefront of everything we do so that our clients have the confidence that their voice is protected and will only be used with their expressed permission.