Neural Network Speech Synthesis using the Tacotron 2 Architecture, or “Get Alignment or Die Tryin”
Our team was assigned the task to repeat the results of the work of the artificial neural network for speech synthesis Tacotron2 by Google. This is a story about the thorny path we have gone down during the course of the project.
The task of computer speech synthesis has long been in the scientists’ and technicians’ focus of attention. However, classical methods do not allow one to synthesize speech indistinguishable from that of a human. That is why, here, as in many other areas, deep learning has come to the rescue.
To start with, let’s have a look at the classical methods of speech synthesis.
Concatenative speech synthesis
This method is based on pre-recording of short audio sequences, which are then combined to create coherent speech. It turns out very clean and clear but is absolutely devoid of emotional and intonational components. For this reason, it sounds unnatural. This happens because it is impossible to get an audio recording of all possible words uttered in all possible combinations of emotions and prosody. Concatenative systems require huge databases and hard coding of combinations to form words. Developing a reliable system takes a lot of time.
Parametric speech synthesis
The use of concatenated TTS is limited due to the large data requirements and long development time. Therefore, a statistical method, which explores the very nature of the data, was developed. It generates speech by combining certain parameters such as frequency, amplitude spectrum, etc.
Parametric synthesis consists of two stages:
First, linguistic features (for example, phonemes, duration, etc.) are extracted from the text.
Then, the features that represent the corresponding speech signal are extracted: cepstrum, frequency, linear spectrogram, Mel spectrogram. Based on this, a vocoder (a system that generates a wave-form) encodes them.
These manually configured parameters, along with linguistic features, are transmitted to the vocoder model. It performs many complex transformations to generate a sound wave. In this case, the vocoder evaluates speech parameters, such as phase, prosody, intonation, and more.
If we can approximate the parameters that define speech on each of its samples, then we can create a parametric model. Parametric synthesis requires much less data and hard work than concatenative systems.
Theoretically, parametric synthesis is simple, but in practice, there are many artifacts that lead to the production of muffled speech with a “buzzing” sidetone, which doesn’t sound like a natural sound.
The thing is, at each stage of the synthesis, we encode some features and hope to achieve a realistic speech production. However the selected data is based on our understanding of speech, and in fact, human knowledge is not entirely absolute. So the selected features will not necessarily be the best possible solution. Here is where Deep Learning takes the stage in all its glory.
Deep neural networks are a powerful tool that, theoretically, can approximate an arbitrarily complex function, that is, to bring some space of input data X into output space Y. In the context of our task, this will be, respectively, text and audio record with speech.
To begin with, let’s define what we have as input and what we want to get at the output. The input data will be the text, and the output will be the Mel-spectrogram, a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. We should also note that the spectrograms obtained in this way still need to be normalized by compressing the dynamic range. This allows reducing the natural ratio between the loudest and quietest sound on the record. In our experiments, the use of spectrograms reduced to the [-4; 4] range has shown itself to be the best.
Figure 1: Mel-spectrogram of speech audio signal reduced to the range [-4; 4]
We chose the LJSpeech dataset as a training data set, which contains 13,100 audio tracks, 2-10 seconds each, and a textual file corresponding to the speech in English recorded on audio. Based on the above-mentioned transformations, the sound is encoded into the Mel-spectrogram. The text is tokenized and turned into a sequence of integers. We also have to emphasize that all texts are normalized, that is all numbers are written as words, and possible abbreviations are decoded (for example: “Mrs. Robinson ”-“ Missis Robinson ”). Thus, after preprocessing, we receive sets of numpy arrays of numerical sequences and Mel-spectrograms recorded in npy-files on the disk.
In order to match all the dimensions in the tensor batches at the learning stage, we will add paddings to short sequences. For sequences in the format of texts, 0 will be reserved for padding, and for spectrograms – frames, the values of which are slightly lower than the minimum value of the spectrograms defined by us. This is recommended for separating these paddings from noise and silence.
Now we have data representing texts and audio that are suitable for processing by an artificial neural network. Let’s consider the architecture of Feature prediction net, which we will call Tacotron2 by the name of the central element of the entire synthesis system.
Tacotron 2 is not one network, but two: Feature prediction net and NN-vocoder WaveNet. The original article, as well as our own vision of the work, makes it possible to consider the Feature prediction net as a leading network, while the WaveNet vocoder plays the role of a peripheral system.
Tacotron2 is the sequence to sequence architecture. It consists of an encoder, which creates some internal representation of the input signal (symbolic tokens), and a decoder, which turns this representation into a Mel-spectrogram. Also, a very important element of the network is the so-called PostNet, designed to improve the spectrogram generated by the decoder.
Figure 2: Tacotron 2 network architecture.
Let us examine the network blocks and their modules in more detail.
The first layer of the encoder is the Embedding layer. It creates multidimensional (512-dimensional) vectors based on a sequence of natural numbers representing symbols. Further along, embedding vectors are fed into a block of three one-dimensional convolutional layers. Each layer includes 512 filters of length 5. This value is a good filter size in this context because it captures a certain character, as well as two previous and two subsequent neighbors. Each convolutional layer is followed by mini-batch normalization and ReLU activation.
The tensors obtained after the convolutional block are fed to bidirectional LSTM layers, 256 neurons each. The forward and backward results are concatenated. The decoder has a recurrent architecture, that is, at each subsequent step, the output (one frame of the spectrogram) from the previous step is used.
Another important, if not the key element of this system, is the mechanism of soft attention, a relatively new and increasingly popular technique. At each decoder step, attention uses the following to form the context vector and update the attention weight:
the projection of the previous hidden state of the decoder’s RNN network onto the fully connected layer;
the projection of the output of the encoder data on a fully connected layer,
as well as additive (accumulated at each time step of the decoder) attention weight.
The idea of attention should be understood as: “what part of the encoder data should be used at the current decoder step.”
Figure 3: Attention mechanism
At each step of the decoder’s operation, the context vector Ci is computed (in the figure above it is designated as “attended encoder outputs”), which is the product of the encoder output (h) and attention weights (α):
where αij – attention weights, calculated by the formula:
where eij is the so-called “energy,” the calculation formula of which depends on the type of attention mechanism you use (in our case it will be a hybrid type using both location-based attention and content-based attention). The energy is calculated by the formula:
eij = vaT tanh(Wsi-1 + Vhj + Ufi,j + b)
si-1 — previous hidden state of the decoder’s LSTM network,
αi-1 — previous attention weights,
hj — j-th hidden encoder state,
W, V, U, va and b — trained parameters,
fi,j — location-signs calculated by the formula:
fi = F * αi-1
where F is a convolution operation.
Some of the modules described below use information from the previous step of the decoder. But if this is the first step, then the information will be zero-value tensors, which is a common practice in creating recurrent structures.
Algorithm of work
First, the decoder output from the previous step is fed into a small PreNet module, which is a stack of two fully connected layers of 256 neurons each, alternating with dropout layers with a rate of 0.5. A distinctive feature of this module is that the dropout is used in it not only at the stage of learning the model but also at the output stage.
The output of PreNet in concatenation with the context vector, obtained as a result of the attention mechanism, is fed to the entrance to the unidirectional two-layer LSTM network, 1024 neurons in each layer.
Then the concatenation of the output data of LSTM-layers with the same (and possibly other) context vector is fed into a fully connected layer with 80 neurons, which corresponds to the number of spectrogram channels. This final decoder layer forms the predicted spectrogram frame by frame. Its output is fed as input to the next time step of the decoder in PreNet.
Why did we mention in the previous paragraph that the context vector might already be different? One possible approach is the recalculation of the context vector after the latent state of the LSTM network is obtained at this step. However, in our experiments, this approach was not justified.
In addition to the projection onto the 80-neural fully-connected layer, the concatenation of the output data of LSTM-layers with the context vector is fed into a fully-connected layer with one neuron, followed by sigmoid activation – this is the “stop token prediction” layer. It predicts the probability that the frame created at this step of the decoder is the final one. This layer is designed to generate a spectrogram not of a fixed, but of arbitrary length at the stage of model output. That is, at the output stage, this element determines the number of decoder steps. It can be considered as a binary classifier.
The decoder output from each step will be the predicted spectrogram. However, this is not all. To improve the spectrogram quality, it is passed through the PostNet module, which is a stack of five one-dimensional convolutional layers with 512 filters in each and with a filter size of 5. Each layer (except the last) is followed by batch-normalization and tangent activation. To return to the spectrogram dimension, we skip the output of the post-net through a fully connected layer with 80 neurons and add the obtained data to the initial result of the decoder. We receive the Mel-spectrogram generated from the text.
All convolutional modules are regularized with dropout layers with a rate of 0.5, and recurrent layers with a newer Zoneout method with a rate of 0.1. It is quite simple: instead of submitting to the next step of an LSTM network the hidden state and cell state obtained at the current step, we replace part of the data with the values from the previous step. This is done both at the training stage and at the output stage. At the same time, only the hidden state is exposed to the Zoneout method (which is transmitted to the next LSTM step) at each step, while the output of the LSTM cell at the current step remains unchanged.
We chose PyTorch as a deep learning framework. Although at the time of the network implementation, it was in a state of pre-release, it was already a very powerful tool for building and training artificial neural networks. In our work, we used other frameworks, such as TensorFlow and Keras. However, the latter was discarded due to the need to implement non-standard custom structures, and if we compare TensorFlow and PyTorch, then using the second one does not give the impression that the model is torn out of the Python language. However, we do not undertake to assert that one of them is better and the other worse. The use of one or another framework may depend on various factors.
The network learns using backpropagation. ADAM is used as an optimizer, Mean Square Error is used as an error function before and after PostNet, and also Binary Cross Entropy above the actual and predicted values of the Stop Token Prediction layer. The resulting error is the simple sum of these three. The model learned on a single GPU GeForce 1080 Ti with 11 GB of memory.
When working with such a large model, it is important to see how the learning process goes. And here TensorBoard became a convenient tool. We tracked the value of the error in both training and validation iterations. In addition, we displayed target spectrograms, predicted spectrograms at the training stage, predicted spectrograms at the validation stage, and alignment, which is additive accumulated attention weights from all the training steps.
It is possible that at first, your attention won’t be too informative:
Figure 4: Poorly trained attention weights.
However, after all your modules start working properly, you will finally get something like this:
Figure 5: Well-trained attention weights.
What does this chart mean? At each step of the decoder, we are trying to decode one frame of the spectrogram. However, it is not entirely clear what information the encoder needs to use at each step. It can be assumed that this correspondence will be direct. For example, if we have an input text sequence of 200 characters and the corresponding spectrogram of 800 frames, then there will be 4 frames per character. However, you must agree that speech generated on the basis of such a spectrogram would be completely devoid of naturalness. We say some words faster, others – slower, sometimes we pause, and sometimes we don’t. And to consider all possible contexts is not possible. That is why attention is a key element of the entire system: it sets the correspondence between the decoder pitch and the information from the encoder to obtain the information necessary to generate a specific frame. And the greater the value of attention weights, the more “attention should be paid” to the corresponding part of the encoder data when generating a spectrogram frame.
At the training stage, it will also be useful to generate audio, and not only visually assess the quality of spectrograms and attention. However, those who have worked with WaveNet would agree that using it as a vocoder during the training phase would be an impermissible luxury in terms of time cost. Therefore, it is recommended to use the Griffin-Lim algorithm, which allows one to partially restore the signal after fast Fourier transforms. Why partially? The fact is, that when converting a signal into spectrograms, we lose information about the phase. However, the quality of the audio thus obtained will be quite enough to understand in which direction you are moving.
So we have shared some thoughts on the construction of the development process, submitting them in the format of tips. Some of them are quite common, others are more specific.
On the organization of the workflow:
Use a version control system that clearly describes all changes. When searching for optimal architecture, changes occur constantly. And having received some satisfactory intermediate result, be sure to make yourself a checkpoint so that you can make subsequent changes boldly.
From our point of view, in such architectures, one should adhere to the principles of encapsulation: one class – one Python-module. This approach is rarely found in ML tasks, but it will help you structure the code and speed up debugging and development. Both in the code and in your vision of the architecture, divide it into blocks, blocks into modules, and modules into layers. If the module has code that performs a certain role, then combine it into a module class method. These are common truths, but we were not too lazy to tell them again.
Provide classes with the numpy-style documentation. This will greatly simplify the work of both you and colleagues who will read your code.
Always draw the architecture of your model. Firstly, it will help you to comprehend it, and secondly, a look from the outside on the architecture and on the hyperparameters of the model will allow you to quickly identify discrepancies in your approach.
It is better to work as a team. If you work alone, still gather colleagues and discuss your work. They can ask you a question that will lead you to some thoughts and will point to a certain inaccuracy that does not allow you to successfully train the model.
Another useful trick is associated with data preprocessing. Suppose you decide to test some hypothesis and make the appropriate changes to the model. However, restarting training, especially before the weekend, will be risky. The approach may be initially wrong and you will waste time. What then to do? Increase the size of the fast Fourier transform window. The default setting is 1024; increase it by 4, or even 8 times. This will “compress” the spectrograms an appropriate number of times and significantly speed up learning. The restored audio will have lower quality, but isn’t that your task right now? In 2-3 hours you can already get an alignment (“alignment” of attention weights, as shown in the figure above), this will indicate the architectural correctness of the approach and can be tested on big data.
Building and training models:
We assumed that if the batches were formed not in a random way, but on the basis of their length, it would speed up the learning process and make the generated spectrograms of higher quality. A logical assumption based on the hypothesis that the more useful signal (and not paddings) is applied to network training, the better. However, this approach did not justify itself; in our experiments, we could not train the network in this way. This is probably due to the loss of randomness of the choice of instances for training.
Use modern network initialization algorithms with some optimized initial states. For example, in our experiments, we used Xavier Uniform Weight Initialization. If you need to use the mini-batch normalization and some activation function in your module, then they should alternate with each other in exactly this order. After all, if we apply, for example, ReLU activation, we will immediately lose all the negative signal that should be involved in the process of normalizing the data of a specific batch.
From a specific learning step, use a dynamic learning rate. It really helps to reduce the value of the error and increase the quality of the generated spectrograms.
After creating a model and unsuccessful attempts to train it in batches from the entire data set, it will be useful to try to retrain it in one batch. If you succeed, you will get alignment, and recreated audio based on the generated spectrograms will contain speech (or at least its similarity). This will confirm that overall your architecture is correct, and only small details are missing.
Speaking of these details. Errors in the construction of the model can be very different. For example, in the initial experiments, we got a classic error – an incorrect activation function after a fully connected layer. Therefore, always ask yourself why you want to use one or another activation function in a specific layer. Here it is useful to decompose everything into separate modules, so it will be easier to inspect each element of the model.
When working with RNN networks, we tried to transmit hidden states and cell states as initializing to the next iteration of training. However, this approach was not justified. Yes, it will give you some hidden view of the entire data set. However, is it necessary in the context of this task? Much more interesting and appropriate approach may be to train the initial hidden state of LSTM-layers in exactly the same way as the usual parameter-weights.
Working with seq2seq-models you face the problem of different lengths of sequences in the batch. It is very simply solved by adding paddings — reserved characters in the case of encoder input data, or frames with specific values in the case of a decoder. And how to properly apply the error function to the predicted and real spectrograms? In our experiments, the use of a mask in the error function showed itself well, in order to read the error only on the useful signal (excluding paddings).
We also have a specific recommendation for the PyTorch framework. Although the LSTM layer in the decoder is, in fact, its LSTM cell, which at each step of the decoder receives information for only one sequence element, it is recommended to use the class torch.nn.LSTM, and not torch.nn.LSTMCell. The reason being that the LSTM backend is implemented in the CUDNN library in C, and LSTMCell is implemented in Python. This trick will allow you to significantly increase the speed of the system.
At the end of the article we will share examples of speech generation from texts that were not contained in the training set: