This post is part of the series In-Depth
Other posts in this series:
- In-Depth: Understanding the Cortex A53 on Mobile SoCs
- In-Depth: Why phone manufacturers continue to eliminate the headphone jack despite all the backlash
- In-Depth: How Google talks to you and what WaveNet is all about (Current)
When a computer talks back to you, it almost seems magical. However, there is a lot of research that goes behind converting text-based answers to speech ones. Initially, during the early days, lots of audio based outputs were recorded by natural human voices. However, with a large number of words present in the human language, it is largely clear that we need to teach the computer how to speak properly.
Background: The Text-to-Voice Technology
There is a lot of history present here. I’d like to keep this slightly technical, so let me introduce you to a few basics. The first text to speech devices appeared way back in the 1980’s. These were the Speak & Spell devices. In 1976, Texas Instruments was researching speech synthesis, and this toy was the result of a funding of $25,000. It was originally intended to teach children how to spell. The devices utilized the TI TMS5100 chip, which utilized a 10th-order linear predictive coding LPC model.
This was a very rough approximation to natural human language, meaning that the sound was not of great quality. However, the core goals of speech output are really simple. It has to sound natural and should be understood by the user easily.
Synthesising: The Approach
There are two broad ways in which we can approach speech synthesis. They are:
- Concatenation Synthesis
- Formant Synthesis
In concatenation, a large database of pre-recorded audio samples. From these samples, there are a lot of techniques by which the relevant bits can be taken and joined to form a natural sentence. This gives you the most natural sounding output, but there is a caveat. The recorded data is mostly flat, so it lacks voice modulation.
Unlike concatenation, formants are bits of sound that are created using additive synthesis and physical modeling synthesis. For additive synthesis, different tones by adding various sine waves together. Physical modeling involves the usage of a bunch of algorithms to model a certain set of sounds. Normally physical modeling is used to generate sounds of musical instruments, but the similarities overlap.
Training voice models
We need to dive into linguistics in order to understand how they record audio.
As per Wikipedia, a phone is any distinct speech sound or gesture. In contrast, a phoneme is a speech sound that, in a given language, if it were swapped with another phoneme, would change the meaning of the word. For example, the phonetic is the difference between someone asking you to kiss them or kill them.
Phones are absolute and are not specific to any language, but phonemes can be discussed only with reference to specific languages. This helps a lot as it can help train a bunch of different languages. For example, Spanish has 800 phonetics. This makes it easier to synthesize voice.
They record a bunch of phonemes first. What are they? These are fundamental units of sound. Google used diphones in order to make their speech synthesis work better. First, you need to understand what they are.
The Google Sound
An amazing video by Nat and Lo highlights how Google started re-training its voice almost a year and a half ago:
So how does Google train it’s voice models?
The big G uses concatenation synthesis in order to train it’s voice samples. A real person (who most likely has to sign a lot of NDAs) is chosen and not only are common phrases recorded, but also their phonetics. This means that all the sounds needed in order to create English words are recorded by the audio team.
There are multiple voices for different languages. This means that there are multiple people who do it for different languages. Every language has a different number of phonemes, which means that the amount of time needed to record the audio varies.
What happens next is that the robot-like sounding audio samples are further modified now using wave nets.
The WaveNet neural network architecture directly generates a raw audio waveform, showing excellent results in text-to-speech and general audio generation. After an audio preprocessing step, the input waveform is quantized to a fixed integer range. The integer amplitudes are then one-hot encoded to produce a tensor of shape.
What Google essentially does is, they transform the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and feed it into WaveNet. This means the network’s predictions are conditioned not only on the previous audio samples but also on the text that they want it to say.
Here’s a sample random audio output generated with tensorflow-wavenet
WaveNet is structured in a different manner. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps. Initially, the training of the models started with the voice sets that Google already had, and eventually it will be able to generate audio on its own.
So the research papers all say that the WaveNets are still slower than the human sounds, but apparently, Google managed to optimize them for speed in this one particular case. Some of my guesses are:
- dynamic programming/caching
- running on the new Intel TPUs in the cloud
- better training of smaller models
- model compression and hyperparameter sweeps
- some low-level optimizations known only to Google
WaveNets have a lot of possible use cases. Some of these are:
- pleasant voice frontends for automatic translation systems
- highly accurate speech synthesis for film and animation
- audiobooks and music
The bottom line is, there are many use cases of AI being able to generate audio that sounds very human-like. These are exciting times indeed.