Tacotron 2: Human-like Speech Synthesis From Text By AI

Logo NIX

Our team was assigned the task of repeating the results of the work of the artificial neural network for speech synthesis Tacotron 2 by Google. This is a story of the thorny path we have gone through during the project. In the very end of the article we will share a few examples of text-to-speech (TTS) conversion using Tacotron2.


  1. Conventional Text-to-Speech Approaches
  2. Applying Deep Learning for Data Pre-Processing
  3. Tacotron 2 Architecture Explained
  4. The Workflow with Tacontron 2
  5. Visualization of Tacotron 2 Processing
  6. Key Takeaways

Conventional Text-to-Speech Approaches

The task of computer speech synthesis has long been the focus of scientists and engineers. However, classic approaches do not synthesize speech indistinguishable from that of a human. That is why, here, as in many other areas, deep learning has come to the rescue.

To start with, let’s have a look at the classic methods of speech synthesis from the text. Basically, there are two of them:

  • Concatenative speech synthesis
  • Parametric speech synthesis

Concatenative Speech Synthesis

This method is based on the pre-recording of short audio sequences, which are then combined to create coherent speech. It turns out very clean and clear but is absolutely devoid of emotion and tone components, so it sounds unnatural. This happens because it is impossible to get an audio recording of all possible words uttered in all possible combinations of emotions and prosody.

Parametric Speech Synthesis

The use of the concatenated TTS method is limited due to a large amount of data and enormous development time required. Therefore, a statistical parametric speech synthesis method, which explores the very nature of the data, was developed. It generates speech by combining certain parameters such as frequency, amplitude spectrum, etc.

Parametric synthesis consists of two stages:

  1. Generation of linguistic features (phonemes, duration) that are extracted from the text.
  2. Generation of features that represent corresponding extracted speech signals (cepstrum, frequency, linear spectrogram, Mel spectrogram). 

These manually-configured parameters, along with linguistic features, are transmitted to the vocoder model, which performs many complex transformations to generate a sound wave. In this case, the vocoder evaluates speech parameters such as phase, prosody, tone, and others.

If we can approximate the parameters that define speech on each of its samples, then we can create a parametric model. Parametric synthesis requires less data and effort than concatenative systems.

Theoretically, parametric synthesis is simple, but in practice, there are many artifacts that lead to the production of muffled speech with a buzzing sidetone, which sounds unnatural. This is happening because at each stage of the synthesis, we encode some features to achieve a realistic speech production. However, the selected data is based on our understanding of the speech, and in fact, human knowledge is not entirely comprehensive so the selected features will not necessarily be the best fit for any situation. Here is where deep learning takes the stage.

Applying Deep Learning for Data Pre-Processing

Deep neural networks are powerful tools that can approximate an arbitrarily complex function – bring some space of input data X into output space Y. In the context of our task, X and Y will be text and audio recordings with speech, respectively.

The input data will be the text. The output will be a Mel-spectrogram — a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. We should also note that the spectrograms obtained in this way still need to be normalized by compressing the dynamic range. This enables reducing the natural ratio between the loudest and quietest sound on the recording. In our experiments, the use of spectrograms reduced to the [-4; 4] range has shown the best results.

Mel-spectrogram of speech audio signal reduced to the range [-4; 4]

We chose the LJSpeech dataset as a training data set, which contains 13’100 audio tracks, 2-10 seconds each, and a text file corresponding to each record (aka lyrics). Based on the above-mentioned transformations, the sound is encoded into the Mel-spectrogram. The text is then tokenized and turned into a sequence of integers. Additionally we normalized all texts, transcribed all numbers into words (“5” —> “five”), and decoded abbreviations (“Mrs. Robinson” —> “Misses Robinson”). So, after preprocessing, we received sets of arrays of numerical sequences and Mel-spectrograms recorded as .npy files.

In order to match all the dimensions in the tensor batches at the learning stage, we’ve added paddings to short sequences. For sequences in text format, 0 was reserved for padding, and for spectrograms, frames, the values of which were slightly lower than the minimum value of the spectrograms defined by us. This is recommended for separating these paddings from noise and silence.

Now we have data representing text and audio that are suitable for processing by a neural network. Let’s consider the architecture of the feature prediction net, which we will call Tacotron 2, named after the central element of the entire synthesis system.

Tacotron 2 Architecture Explained

Tacotron 2 is not one network, but two: Feature prediction net and NN-vocoder WaveNet. Feature prediction net is considered as the main network, while the WaveNet vocoder plays the role of a supplemental system. 

Tacotron2 has sequence to sequence architecture. It consists of an encoder, which creates internal representation of the input signal (symbolic tokens), and a decoder, which turns this representation into a Mel-spectrogram. A very important element of the network is the PostNet, designed to improve the spectrogram generated by the decoder.

Tacotron 2 network architecture

Let’s examine the network blocks and their modules in detail.

The first layer of the encoder is the embedding layer, which creates 512-dimensional vectors based on a sequence of natural numbers that represent symbols. Further along, embedding vectors are directed into a block of 3 one-dimensional convolutional layers. Each layer includes 512 filters with a length of 5. This value is a good filter size because it captures a certain character, as well as two previous and two next neighbors. Each convolutional layer is followed by a mini-batch normalization and ReLU activation. 

The tensors obtained after the convolutional block are directed to bidirectional LSTM layers, each per 256 neurons. The forward and backward results are concatenated. The decoder has a recurrent architecture. So the output from the previous step (one frame of the spectrogram) is used at each next step. 

Another crucial element of the system is the mechanism of soft attention — a relatively new and popular technique. At each decoding step, “attention” forms the context vector and updates the attention weight, using following data: 

  • the projection of the previous hidden state of the decoder’s RNN network onto the fully connected layer,
  • the projection of the output of the encoder data on a fully connected layer,
  • the additive (accumulated at each time step of the decoder) attention weight.

Attention defines the part of the encoder data that should be used at the current decoder step.

Tacotron's 2 attention mechanism

At each step of the decoder’s operation, the context vector Ci is computed (attended encoder outputs on the image above), which is the product of the encoder output (h) and attention weights (α), calculated by the formula:

Artificial neural network for speech synthesis Tacotron2

where αij – attention weights, calculated by:

Artificial neural network for speech synthesis Tacotron2

where Eij is the energy. While in our case we use a hybrid type of attention mechanism (both location-based and content-based attention), the calculation formula of the energy will be:

energy calculation - Tacotron 2


si-1 — previous hidden state of the decoder’s LSTM network,

αi-1 — previous attention weights,

hj — j-th hidden encoder state,

W, V, U, va and b — trained parameters,

fi,j — location-signs calculated by the formula:

Location signs calculation - Tacotron 2


F is a convolution operation, 

αi-1 – previous attention weights.

Some of the modules use information from the previous step of the decoder. But on the first step, the information will be zero-value tensors, which is a common approach in creating recurrent structures.

The Workflow with Tacontron 2

First, the decoder output from the previous step follows into a small PreNet module, which is a stack of two fully connected layers of 256 neurons each, alternating with dropout layers with a rate of 0.5. A distinctive feature of this module is that the dropout is used in it not only on the stage of learning the model but also at the output stage.

The output of PreNet in concatenation with the context vector, obtained as a result of the attention mechanism, is directed to the entrance of the unidirectional two-layer LSTM network with 1024 neurons in each layer.

Then the concatenation of the output data of LSTM-layers with the same (or possibly other) context vector is directed into a fully connected layer with 80 neurons, which corresponds to the number of spectrogram channels. This final decoder layer forms the predicted spectrogram frame by frame. Its output serves as an input to the next time step of the decoder in PreNet.

Why did we mention that the context vector might already be different? One possible approach is the recalculation of the context vector once the latent state of the LSTM network is obtained. However, in our experiments, this approach was not justified.

In addition to the projection onto the 80-neural fully-connected layer, the concatenation of the output data of LSTM-layers with the context vector is directed into a fully-connected layer with one neuron, followed by sigmoid activation – this is known as the stop token prediction layer. It predicts the probability that the frame created at the step of the decoder is the final one. This layer is designed to generate a spectrogram not of a fixed length but of an arbitrary one at the stage of model output. At the output stage this element determines the number of decoder steps. It can be considered as a binary classifier.

The decoder output from each step will be the predicted spectrogram. However, this is not the end. To improve the spectrogram quality, it is passed through the PostNet module, which is a stack of 5 one-dimensional convolutional layers with 512 filters in each and a filter size of 5. Each layer (except the last) is followed by batch-normalization and tangent activation. To return to the spectrogram dimension, we skip the output of the PostNet through a fully-connected layer with 80 neurons and add the obtained data to the initial result of the decoder. We then receive the Mel-spectrogram generated from the text.

All convolutional modules are regularized with dropout layers with a rate of 0.50, and recurrent layers with a newer zoneout method with a rate of 0.10. It is quite simple: instead of submitting the hidden state and cell state obtained at the current step to the next step of an LSTM-network, we replaced part of the data with the values ​​from the previous step. This is done both at the training and output stages. At the same time, only the hidden state is exposed to the zoneout method at each step, which is transmitted to the next LSTM step, while the output of the LSTM cell at the current step remains unchanged.

We chose PyTorch as a deep learning framework. Although at the time of network implementation, it was in a state of pre-release. PyTorch is a very powerful tool for building and training artificial neural networks. We had experience using TensorFlow and Keras frameworks, however due to the need to implement non-standard custom structures, we’ve chosen PyTorch. However, it doesn’t mean that one framework is better than the other; the choice of framework depends on various factors.

The network was learning using a backpropagation algorithm. ADAM was used as an optimizer. The model has been learning on a single GPU GeForce 1080 Ti with 11 GB of RAM.

Visualization of Tacotron 2 Processing

While working with such a huge model, it is important to monitor how the learning process goes. Here, TensorBoard became a convenient tool. We tracked the value of the errors in both training and validation iterations. In addition, we displayed target spectrograms, predicted spectrograms on the training and validation stages, and performed an alignment, which is additive accumulated attention weights from all the training stages.

At first, the attention wasn’t very informative:

Tacotron 2 poorly trained attention weights.

However, after all the modules started working properly, something like this is displayed:

Tacotron 2 well-trained attention weights.

At each step of the decoder, we tried to decode one frame of the spectrogram. However, it is not entirely clear what information the encoder needs to use at each step. Probably this correspondence will be direct. For example, if we have an input text sequence of 200 characters and the corresponding spectrogram of 800 frames, then there will be 4 frames per character. However, speech generated on the basis of such a spectrogram would sound unnatural. We say some words faster, others slower, sometimes we make pauses, and sometimes we don’t. It is impossible to foresee all possible contexts. That is why attention is a key element of the entire system: it sets the correspondence between the decoder pitch and the information from the encoder. In this way we obtain the information necessary to generate a specific frame. The greater value of the attention weight, the more “attention” should be “paid” to a respective part of the encoder data when generating a spectrogram frame.

At the training stage, not only the visual assessment of the quality of spectrograms and attention is important, but also the generated audio. However, using WaveNet as a vocoder during the training phase is an impermissible luxury in terms of time. Therefore, we used the Griffin-Lim algorithm, which enables a partial restore of the signal after fast Fourier transforms. Why partial? The fact is, while converting a signal into spectrograms, we lose information about the phase. However, the quality of audio obtained in this way is enough to understand if we are moving in the right direction.

Key Takeaways

We have accumulated some thoughts on the development process with text to speech synthesis so far. Some of them are common, others are more specific and unique. Here are our tips for those who consider Tacotron 2 as a text-to-speech solution for their projects.

General Tips on the Workflow with Tacontron 2:

  • Use a version control system that clearly describes all changes. While searching for optimal architecture, changes occur constantly. Make sure you are making checkpoints each time you obtain some satisfactory results.
  • In Tacontron 2 and similar architectures, you should follow the principles of encapsulation: 1 class equals 1 Python-module. This approach is rarely found in ML tasks but helps to structure the code, speed up debugging, and the development process. 
  • Provide classes with the NumPy-style documentation. This will simplify the team work.
  • Design the architecture of your model. It will help you get a high level view on the model, get feedback from colleagues on the architecture, and quickly identify discrepancies.
  • It is better to work with a team. If you work alone, consider gathering colleagues and discussing your project. Curious questions will lead to new thoughts, and reveal inaccuracy that could prevent the successful training of the model.
  • Another useful tip is associated with data preprocessing. Increase the size of the fast Fourier transform window. The default setting is 1024; increase it by x4, or even x8 times. This will “compress” the spectrograms and significantly speed up learning. The restored audio will have lower quality. However, in 2-3 hours you’ll get an alignment (“alignment” of attention weights). This will indicate the architectural correctness of the approach and ability to jump into tests with big data.

Building and Training of Text-to-Speech Models Based on Tacontron 2 Architecture:

  • If the batches are formed on the basis of their length (not randomly), it would speed up the learning process and improve the quality of generated spectrograms. 
  • Use modern network initialization algorithms with optimized initial states. For example, we used Xavier Uniform Weight Initialization. If you need to use the mini-batch normalization and some activation function in your module, they should alternate in exactly the same order. For example: if you apply ReLU activation, you’ll immediately lose all the negative signals that should be involved in the process of normalizing the data of a specific batch.
  • After a specific learning step, use a dynamic learning rate. This helps reduce the value of errors and increase the quality of the generated spectrograms.
  • After creating a model and fail to train it in batches from the entire data set, try retraining it in one batch. If you succeed, you will get an alignment and recreated audio based on the generated spectrograms that contain speech (or at least its similarity). This will confirm that the overall architecture is correct, and only small details are missing.
  • Speaking of these details: errors in the construction of the model can vary. For example, in the initial experiments we received a common error – an incorrect activation function after a fully-connected layer. Therefore, always ask yourself why you use one or another activation function in a specific layer. It is useful to decompose everything into separate modules, so it will be easier to inspect each element of the model.
  • When working with RNN networks, we tried to transmit hidden states and cell states as initializing to the next iteration of training. However, this approach was not justified. It could provide some hidden view of the entire data set. However, it was unnecessary in the context of the task. A much more interesting and appropriate approach may be training the initial hidden state of LSTM-layers in exactly the same way as the usual parameter-weights.
  • While working with seq2seq-models you will face the problem of different lengths of sequences in the batch. It is simply solved by adding paddings – reserved characters (in case of encoder input data), or frames with specific values (in case of a decoder). However, you should properly apply the error function to the predicted and real spectrograms. We used a mask in the error function for reading the errors excluding paddings.
  • We also have a specific recommendation for the PyTorch framework. Use the class torch.nn.LSTM, and not torch.nn.LSTMCell. The reason for that is the LSTM backend is implemented in the CUDNN library in C, and LSTMCell is implemented in Python. This trick will increase the speed of the system significantly.

Thank you for reading until the very end. As a bonus, here are all examples of speech generation from text using Tacotron 2 architecture that is not contained in the training set: