Blog

03.16.23

Step-by-Step Guide: How to Create a Synthetic Voice Using AI Technology

ASHLEY BAILEY DIRECTOR OF PRODUCT MARKETING, AI VOICE AND EMERGING SYNTHETIC TECHNOLOGIES

So far in this blog series, we’ve covered deepfake voice, voice cloning, text-to-speech AI, synthetic voice, and deepfake voice fraud. While we have covered the topic of synthetic voice and its potential use cases, we will uncover exactly how to create a synthetic voice in this sixth and final blog.

We will cover the following topics:

Sourcing training data for synthetic voice creation
How does the AI voice modeling process work?
How to create an AI voice using standard output methods?
Navigating licensing and compliance

Collecting and preparing training data for synthetic voice creation

Training data is the first essential item you need to start synthetic voice generation. There are two methods to source this data. First, if you have hours of audio of your voice available, they can use that to create the necessary training data. Veritone followed this approach quite recently with Bryan Barletta of Sounds Profitable.

Naturally, if you work in podcasting, you’ll have this audio available. But not everyone already has hours of audio of their voice readily available. If you do not have a library of prerecorded clips to source your voice data, you must get into a studio and record.

This task is the most significant commitment on the client side. Without clean audio recordings of one’s voice, there’s no way to successfully train the AI models to capture all the intricate details unique to a person’s speech. The recording process can take a few hours to several depending on the aspects of the project and schedule.

The voice solutions team will provide a comprehensive list of phrases to capture all the characteristics of one’s voice. This list usually contains thousands of phrases. Usually, this list won’t exceed 4,000 phrases, but the goal is really to capture as much data around someone’s unique voice as possible—the more data you capture, the more accurate the voice clone.

Understanding the AI voice modeling process for custom synthetic voices

Covered briefly in our third blog in the series, understanding text-to-speech AI, you might be wondering more technically how AI professionals use this training data to create a custom synthetic voice. The most common method in producing an artificially created voice is called concatenative synthesis . The definition of concatenative is to link in a series or put things in an order dependent upon each other.

Concatenative synthesis searches phonemes, distinct units of sound in a specific language, and strings together these pieces of recorded speech to produce synthesized speech. There are three subtypes of concatenative synthesis, which we’ll discuss later.

But generally, this approach does not yield the best voice quality because of how it pulls these units together. The system draws from an audio database, segmenting waveforms that create enough speech variation to make it sound unrealistic to the human ear—this changes with neural networks.

Using a neural network takes an ordered set of phonemes and then transforms them into a set of spectrograms. A spectrogram is a visualized rendering of the spectrum of frequency bands of a signal.

The neural network uses these and selects the appropriate spectrograms with the frequency bands that more accurately articulate acoustic features the human brain uses in understanding and systematically organizing speech. A neural vocoder then translates these spectrograms into speech waveforms, delivering a natural-sounding replication of a voice.

Generating a custom synthetic voice using standard output methods

There are two modalities in creating the outputs of voice content. Text to speech (TTS) uses text to generate synthetic speech, often with predeveloped voices, versus speech to speech (STS), which uses audio from a person’s voice to create a custom voice. In addition, you can transform a voice into different genders, dialects, and languages if you have the necessary data.

A TTS system functions with a front end and a back end. There are two tasks that a TTS system does on the front end, text normalization, converting raw text that include symbols, abbreviations, and numbers into written words.

This process is also called text preprocessing or tokenization. The second task is assigning phonetic transcriptions to words. The output of the front end is a symbolic representation of the phonetic transcription and prosody.

Speech synthesis then happens on the back end after receiving the output from the front end. There are three main subtypes of the most popular method used for speech synthesis, concatenative synthesis, which we’ve already explained. These subtypes include:

Domain-specific synthesis: strings together prerecorded words or phrases to form complete remarks. This method is commonly used for simple, repetitive use cases such as weather reports or when traveling. However, it’s limited only to the words and phrases that it has been programmed to use.
Unit selection synthesis: pulls from an extensive database of prerecorded speech audio clips and breaks these recordings down by individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. These units are then indexed and are later put back together as it determines the best sequence for the target phrase.
Diphone synthesis: leverages a database containing all the diphones that occur in a specific language. To give you an idea, Spanish has around 800 diphones whereas German has 2500. When pulled, prosody is superimposed on these units using digital signaling processing techniques. The output sounds not as good as unit selection but better than domain-specific synthesis.

To learn more about deepfake voice, visit our previous blog to learn all you need to know about this technology.

Navigating licensing and compliance for custom synthetic voices

As we covered in our last blog, deepfake voice fraud has started to surface in the previous few years. Theoretically, if people have the know-how, the technical expertise, and the technological backing, they can find ways to replicate anyone’s voice illicitly. However, AI professionals can quickly discredit its authenticity by using standards Veritone and others are putting into place, such as inaudible watermarks.

Veritone Voice, the application that acts as the driving force for our voice as a service solution, has built-in protective measures. Owners can also issue content claims against improper uses of their voice in the system. Combining our vast experience in content licensing with some of the largest rights holders, advertisers, and brands in the world, we not only have the capability to create synthetic voices but also guide people through how third parties would use their voices.

We hope this blog series has provided you with solid, foundational knowledge on this subject. As the technology evolves, along with new frontiers like the metaverse, deepfake technology will only continue to make the headlines and solidify its place as a mechanism for ethical content creation and monetization.

Talk to us today to learn more about our voice as a service solution.

01.24.23 - ETHAN BAKER

Deepfake Voice—Everything You Should Know in 2023

Learn More

11.01.22 - ASHLEY BAILEY

HOW TO IMPROVE LOCALIZATION IN AUDIO PRODUCTIONS