📄Reading Note: "A Survey on Neural Speech Synthesis"
Tan, Xu, et al. "A Survey on Neural Speech Synthesis." arXiv preprint arXiv:2106.15561 (2021).
Last updated
Tan, Xu, et al. "A Survey on Neural Speech Synthesis." arXiv preprint arXiv:2106.15561 (2021).
Last updated
Text to speech (TTS), also known as speech synthesis, aims to synthesize intelligible and natural speech from text [1].
Neural TTS adopts DNNs as the model backbone for speech synthesis (trend: end-to-end, with high voice quality in intelligibility and naturalness, and less requirement on human pre-processing and feature development).
Characters: the raw format of text;
Linguistic features: obtained through text analysis, containing rich context information about pronunciation and prosody (Phonemes "音素": one of the most important elements in linguistic features, usually used alone to represent text in neural based TTS models);
Acoustic features: abstract representations of speech waveform (mel-spectrograms "梅尔谱" or linear-spectrograms "线性谱" are usually used as acoustic features in neural E2E TTS, which are then converted into waveform using neural vocoders);
Waveform: the final format of speech.
character -> linguistic features -> acoustic features -> waveform;
character -> phoneme -> acoustic features -> waveform;
character -> linguistic features -> waveform;
character -> phoneme -> waveform;
character -> waveform.
Text analysis, also called frontend in TTS, transforms input text into linguistic features that contain rich information about pronunciation and prosody to ease speech synthesis.
Previously, typical tasks for text analysis in statistic parametric synthesis include:
Text normalization: convert raw written text (non-standard words) into spoken-form words, making the words easy to pronounce for TTS models (e.g., "1989" -> "nineteen eighty nine");
Word segmentation: detect the word boundary from raw text (necessary for character-based languages such as Chinese), ensuring the accuracy of the later three processes;
Part-of-speech (POS) tagging "词性标注": for each word (such as noun, verb and preposition), important for the later two processes;
Prosody "韵律" prediction: rely on tagging systems (vary for different languages) to label each kind of prosody, such as rhythm, stress, and intonation of speech, corresponding to the variations in syllable duration, loudness and pitch;
Grapheme-to-phoneme (G2P) conversion: convert character (grapheme "字素") into pronunciation (phoneme) to ease speech synthesis (e.g., "speech" -> "s p iy ch").
After text analyses, we can further construct linguistic features, usually by aggregating the results from different levels including phoneme, syllable, word, phrase and sentence levels [2], and then take them as input to the later part of TTS pipeline.
Although text analysis seems to receive less attention in neural TTS, text normalization is still needed there to get standard word format from character input, and G2P conversion is further needed to get phonemes from standard word format.
Acoustic models generate acoustic features from linguistic features or directly from phonemes or characters, which are to be converted then into waveform using vocoders.
The choice of acoustic features largely determines the types of TTS pipeline. Current choices include mel-cepstral coefficients (MCC), melgeneralized coefficients (MGC), band aperiodicity (BAP), fundamental frequency (F0), voiced/unvoiced (V/UV), bark-frequency cepstral coefficients (BFCC), and the most widely used in neural-based end-to-end TTS, mel-spectrograms.
--- TBC ---
Autoregressive vocoders: take linguistic features or mel-spectrograms as input and generate waveform, typically taking much inference time;
Flow-based Vocoders: normalizing flow, a kind of generative model, transforms a probability density with a sequence of invertible mappings based on the change-of-variables rules [3], and during sampling, generates data from a standard probability distribution through the inverse of these transforms;
GAN-based Vocoders: consists a generator for data generation and a discriminator to judge the authenticity of the generated data;
Diffusion-based Vocoders: formulate the mapping between data and latent distributions with diffusion process and reverse process, with very high voice quality but slow inference speed due to long iterative process.
Fully end-to-end TTS models generate speech waveform from character/phoneme sequence directly.
Advantages:
less human annotation and feature development (e.g., alignment information between text and speech);
joint and end-to-end optimization, avoiding error propagation in cascaded models;
reduce the training, development and deployment cost.
Challenge:
the different modalities between text and speech waveform;
the huge length mismatch between character/phoneme sequence and waveform sequence.
Trends (typically towards fully end-to-end):
simplify text analysis module and linguistic features (e.g., at most retaining text normalization and G2P conversion);
simplify acoustic features from the complicated MGC, BAP and F0 to mel-spectrograms;
replace two or three modules with a single end-to-end model (e.g., acoustic models + vocoders -> single vocoder).
In neural TTS, robust issues such as word skipping, repeating, and attention collapse often happen in acoustic models when generating mel-spectrogram sequence from character/phoneme sequence.
The causes of these robust issues are from two categories:
The difficulty in learning the alignments between characters/phonemes and mel-spectrograms;
The exposure bias and error propagation problems incurred in autoregressive generation.
Solutions for cause 1:
enhancing the robustness of attention mechanism;
removing attention and instead predicting duration explicitly to bridge the length mismatch between text and speech.
Solutions for cause 2:
improving autoregressive generation to alleviate the exposure bias and error propagation;
removing autoregressive generation and instead using non-autoregressive generation.
Vocoders do not face severely robust issues, since the acoustic features and waveform are already aligned frame-wisely (i.e., each frame of acoustic features correspond to a certain number (hop size) of waveform points).
[1] Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.
[2] Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. Speech synthesis based on hidden Markov models. Proceedings of the IEEE, 101(5):1234–1252, 2013.
[3] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530–1538. PMLR, 2015.