Google’s Tacotron 2 simplifies AI speech techniques, its AI-powered advanced system is the latest and improvised version of text-to-speech. The Tacotron 2 is used in neural networks, where the human-like speech from the text is generated.
Researchers at Google have been working on the methods and technology using which they can generate speech from text i.e. text-to-speech conversion (TTS) from a long time. Tacotron and WaveNet were the start which led this research after many more improvements towards Tacotron 2. These neural networks have been trained using only speech examples and corresponding text transcripts.
WORKING & FEATURES
Its working can be briefly understood as a sequence-to-sequence model which is optimized for TTS to map (or locate) a sequence of letters to a sequence of features that encode the audio.
An 80-dimensional audio spectrogram with frames is computed every 12.5 milliseconds.
These frames capture pronunciation of words as well as various subtitles of human speech including volume, speed and intonation (pitch & tone).
Final process includes conversion of these features to a 24 kHz waveform using a WaveNet- like architecture that acts as a vocoder.
The audio generated is human-like, trained directly from data, without any complex feature.
A detailed architecture model of Tacotron 2 can be given as follows. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram.
Google has also provided the audio samples of Tacotron 2, demonstrating the state-of-the-art TTS system. The audio is almost near to the professional recordings when considered on the basis of naturalness of speech. The system, though is thorough in its job, still encounters difficulties in pronouncing complex words like decorum, and so on and generates random noise. Moreover, reacting to speech (or emotion) according to its meaning i.e. when to react happy and sad is still being worked upon.