Our team was assigned the task of repeating the results of the work of the artificial neural network for speech synthesis Tacotron 2 by Google. This is a story of the thorny path we have gone through during the project. In the very end of the article we will share a few examples of text-to-speech (TTS) conversion using Tacotron2.
The task of computer speech synthesis has long been the focus of scientists and engineers. However, classic approaches do not synthesize speech indistinguishable from that of a human. That is why, here, as in many other areas, deep learning has come to the rescue.
To start with, let’s have a look at the classic methods of speech synthesis from the text. Basically, there are two of them:
This method is based on the pre-recording of short audio sequences, which are then combined to create coherent speech. It turns out very clean and clear but is absolutely devoid of emotion and tone components, so it sounds unnatural. This happens because it is impossible to get an audio recording of all possible words uttered in all possible combinations of emotions and prosody.
The use of the concatenated TTS method is limited due to a large amount of data and enormous development time required. Therefore, a statistical parametric speech synthesis method, which explores the very nature of the data, was developed. It generates speech by combining certain parameters such as frequency, amplitude spectrum, etc.
Parametric synthesis consists of two stages:
These manually-configured parameters, along with linguistic features, are transmitted to the vocoder model, which performs many complex transformations to generate a sound wave. In this case, the vocoder evaluates speech parameters such as phase, prosody, tone, and others.
If we can approximate the parameters that define speech on each of its samples, then we can create a parametric model. Parametric synthesis requires less data and effort than concatenative systems.
Theoretically, parametric synthesis is simple, but in practice, there are many artifacts that lead to the production of muffled speech with a buzzing sidetone, which sounds unnatural. This is happening because at each stage of the synthesis, we encode some features to achieve a realistic speech production. However, the selected data is based on our understanding of the speech, and in fact, human knowledge is not entirely comprehensive so the selected features will not necessarily be the best fit for any situation. Here is where deep learning takes the stage.
Deep neural networks are powerful tools that can approximate an arbitrarily complex function – bring some space of input data X into output space Y. In the context of our task, X and Y will be text and audio recordings with speech, respectively.
The input data will be the text. The output will be a Mel-spectrogram — a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. We should also note that the spectrograms obtained in this way still need to be normalized by compressing the dynamic range. This enables reducing the natural ratio between the loudest and quietest sound on the recording. In our experiments, the use of spectrograms reduced to the [-4; 4] range has shown the best results.
We chose the LJSpeech dataset as a training data set, which contains 13’100 audio tracks, 2-10 seconds each, and a text file corresponding to each record (aka lyrics). Based on the above-mentioned transformations, the sound is encoded into the Mel-spectrogram. The text is then tokenized and turned into a sequence of integers. Additionally we normalized all texts, transcribed all numbers into words (“5” —> “five”), and decoded abbreviations (“Mrs. Robinson” —> “Misses Robinson”). So, after preprocessing, we received sets of arrays of numerical sequences and Mel-spectrograms recorded as .npy files.
In order to match all the dimensions in the tensor batches at the learning stage, we’ve added paddings to short sequences. For sequences in text format, 0 was reserved for padding, and for spectrograms, frames, the values of which were slightly lower than the minimum value of the spectrograms defined by us. This is recommended for separating these paddings from noise and silence.
Now we have data representing text and audio that are suitable for processing by a neural network. Let’s consider the architecture of the feature prediction net, which we will call Tacotron 2, named after the central element of the entire synthesis system.
Tacotron 2 is not one network, but two: Feature prediction net and NN-vocoder WaveNet. Feature prediction net is considered as the main network, while the WaveNet vocoder plays the role of a supplemental system.
Tacotron2 has sequence to sequence architecture. It consists of an encoder, which creates internal representation of the input signal (symbolic tokens), and a decoder, which turns this representation into a Mel-spectrogram. A very important element of the network is the PostNet, designed to improve the spectrogram generated by the decoder.
Let’s examine the network blocks and their modules in detail.
The first layer of the encoder is the embedding layer, which creates 512-dimensional vectors based on a sequence of natural numbers that represent symbols. Further along, embedding vectors are directed into a block of 3 one-dimensional convolutional layers. Each layer includes 512 filters with a length of 5. This value is a good filter size because it captures a certain character, as well as two previous and two next neighbors. Each convolutional layer is followed by a mini-batch normalization and ReLU activation.
The tensors obtained after the convolutional block are directed to bidirectional LSTM layers, each per 256 neurons. The forward and backward results are concatenated. The decoder has a recurrent architecture. So the output from the previous step (one frame of the spectrogram) is used at each next step.
Another crucial element of the system is the mechanism of soft attention — a relatively new and popular technique. At each decoding step, “attention” forms the context vector and updates the attention weight, using following data:
Attention defines the part of the encoder data that should be used at the current decoder step.
At each step of the decoder’s operation, the context vector Ci is computed (attended encoder outputs on the image above), which is the product of the encoder output (h) and attention weights (α), calculated by the formula:
where αij – attention weights, calculated by:
where Eij is the energy. While in our case we use a hybrid type of attention mechanism (both location-based and content-based attention), the calculation formula of the energy will be:
si-1 — previous hidden state of the decoder’s LSTM network,
αi-1 — previous attention weights,
hj — j-th hidden encoder state,
W, V, U, va and b — trained parameters,
fi,j — location-signs calculated by the formula:
F is a convolution operation,
αi-1 – previous attention weights.
Some of the modules use information from the previous step of the decoder. But on the first step, the information will be zero-value tensors, which is a common approach in creating recurrent structures.
First, the decoder output from the previous step follows into a small PreNet module, which is a stack of two fully connected layers of 256 neurons each, alternating with dropout layers with a rate of 0.5. A distinctive feature of this module is that the dropout is used in it not only on the stage of learning the model but also at the output stage.
The output of PreNet in concatenation with the context vector, obtained as a result of the attention mechanism, is directed to the entrance of the unidirectional two-layer LSTM network with 1024 neurons in each layer.
Then the concatenation of the output data of LSTM-layers with the same (or possibly other) context vector is directed into a fully connected layer with 80 neurons, which corresponds to the number of spectrogram channels. This final decoder layer forms the predicted spectrogram frame by frame. Its output serves as an input to the next time step of the decoder in PreNet.
Why did we mention that the context vector might already be different? One possible approach is the recalculation of the context vector once the latent state of the LSTM network is obtained. However, in our experiments, this approach was not justified.
In addition to the projection onto the 80-neural fully-connected layer, the concatenation of the output data of LSTM-layers with the context vector is directed into a fully-connected layer with one neuron, followed by sigmoid activation – this is known as the stop token prediction layer. It predicts the probability that the frame created at the step of the decoder is the final one. This layer is designed to generate a spectrogram not of a fixed length but of an arbitrary one at the stage of model output. At the output stage this element determines the number of decoder steps. It can be considered as a binary classifier.
The decoder output from each step will be the predicted spectrogram. However, this is not the end. To improve the spectrogram quality, it is passed through the PostNet module, which is a stack of 5 one-dimensional convolutional layers with 512 filters in each and a filter size of 5. Each layer (except the last) is followed by batch-normalization and tangent activation. To return to the spectrogram dimension, we skip the output of the PostNet through a fully-connected layer with 80 neurons and add the obtained data to the initial result of the decoder. We then receive the Mel-spectrogram generated from the text.
All convolutional modules are regularized with dropout layers with a rate of 0.50, and recurrent layers with a newer zoneout method with a rate of 0.10. It is quite simple: instead of submitting the hidden state and cell state obtained at the current step to the next step of an LSTM-network, we replaced part of the data with the values from the previous step. This is done both at the training and output stages. At the same time, only the hidden state is exposed to the zoneout method at each step, which is transmitted to the next LSTM step, while the output of the LSTM cell at the current step remains unchanged.
We chose PyTorch as a deep learning framework. Although at the time of network implementation, it was in a state of pre-release. PyTorch is a very powerful tool for building and training artificial neural networks. We had experience using TensorFlow and Keras frameworks, however due to the need to implement non-standard custom structures, we’ve chosen PyTorch. However, it doesn’t mean that one framework is better than the other; the choice of framework depends on various factors.
The network was learning using a backpropagation algorithm. ADAM was used as an optimizer. The model has been learning on a single GPU GeForce 1080 Ti with 11 GB of RAM.
While working with such a huge model, it is important to monitor how the learning process goes. Here, TensorBoard became a convenient tool. We tracked the value of the errors in both training and validation iterations. In addition, we displayed target spectrograms, predicted spectrograms on the training and validation stages, and performed an alignment, which is additive accumulated attention weights from all the training stages.
At first, the attention wasn’t very informative:
However, after all the modules started working properly, something like this is displayed:
At each step of the decoder, we tried to decode one frame of the spectrogram. However, it is not entirely clear what information the encoder needs to use at each step. Probably this correspondence will be direct. For example, if we have an input text sequence of 200 characters and the corresponding spectrogram of 800 frames, then there will be 4 frames per character. However, speech generated on the basis of such a spectrogram would sound unnatural. We say some words faster, others slower, sometimes we make pauses, and sometimes we don’t. It is impossible to foresee all possible contexts. That is why attention is a key element of the entire system: it sets the correspondence between the decoder pitch and the information from the encoder. In this way we obtain the information necessary to generate a specific frame. The greater value of the attention weight, the more “attention” should be “paid” to a respective part of the encoder data when generating a spectrogram frame.
At the training stage, not only the visual assessment of the quality of spectrograms and attention is important, but also the generated audio. However, using WaveNet as a vocoder during the training phase is an impermissible luxury in terms of time. Therefore, we used the Griffin-Lim algorithm, which enables a partial restore of the signal after fast Fourier transforms. Why partial? The fact is, while converting a signal into spectrograms, we lose information about the phase. However, the quality of audio obtained in this way is enough to understand if we are moving in the right direction.
We have accumulated some thoughts on the development process with text to speech synthesis so far. Some of them are common, others are more specific and unique. Here are our tips for those who consider Tacotron 2 as a text-to-speech solution for their projects.
Thank you for reading until the very end. As a bonus, here are all examples of speech generation from text using Tacotron 2 architecture that is not contained in the training set:
Configure subscription preferences
Trends & Researches
Web and mobile application for control of the temperature inside the building saving the maximum possible energy.
vSentry is a AI-powered web application that utilizes ML and deep learning to detect and prevent vehicle cyber attacks.
See more success stories
Our representative gets in touch with you within 24 hours.
We delve into your business needs and our expert team drafts the optimal solution for your project.
You receive a proposal with estimated effort, project timeline and recommended team structure.