The Microsoft AI team introduces NaturalSpeech 2, an advanced TTS system featuring latent diffusion models for robust zero-shot voice synthesis and improved expressive prosodies

TL;DR:

  • Microsoft introduces NaturalSpeech 2, an advanced TTS system with latent diffusion models for robust zero-shot voice synthesis and expressive prosodies.
  • TTS systems have made significant advancements in naturalness and intelligibility but lack diversity in capturing speaker identities and styles.
  • NaturalSpeech 2 uses continuous vectors instead of discrete tokens, shortening sequences for accurate speech reconstruction.
  • It replaces autoregressive models with diffusion models, improving stability and zero-shot capacity.
  • Speech prompting mechanisms facilitate in-context learning, enhancing zero-shot capacity and contextually appropriate speech synthesis.
  • NaturalSpeech 2 relies on a single acoustic model, enabling diverse styles beyond speech synthesis.
  • It outperforms previous TTS systems in generating natural speech with similar prosody to prompts and ground-truth speech.
  • NaturalSpeech 2 achieves comparable or better naturalness than ground-truth speech on test sets.
  • It can generate singing voices with novel timbre using short prompts, unlocking true zero-shot singing synthesis.
  • Future research will focus on accelerating the diffusion model and enabling mixed speaking/singing capabilities.

Main AI News:

The ever-evolving field of text-to-speech (TTS) aims to create an array of high-quality, dynamic speech that convincingly emulates human conversation. The intricate tapestry of prosodies, speaker identities, and various vocal styles all contribute to the tapestry of human speech.

Thanks to the progress made in neural networks and deep learning, TTS systems have made remarkable advancements in terms of intelligibility and naturalness. In fact, certain systems, like NaturalSpeech, have achieved a level of voice quality comparable to that of humans, as demonstrated by benchmarking datasets recorded in professional studios.

Despite these strides, the lack of diversity in available data has hindered the ability to capture the vast spectrum of speaker identities, prosodies, and styles present in human speech. However, through the utilization of few-shot or zero-shot technologies, TTS models can now be trained on extensive corpora, enabling them to comprehend these nuanced variations and apply them to an infinite range of unseen scenarios. Modern large-scale TTS systems commonly employ the practice of quantizing continuous speech waveforms into discrete tokens and modeling these tokens using autoregressive language models.

In a recent breakthrough, Microsoft researchers have unveiled NaturalSpeech 2, a cutting-edge TTS system that leverages latent diffusion models to generate expressive prosody, robustness, and, most notably, exceptional zero-shot capability for voice synthesis. This groundbreaking innovation commenced with the training of a neural audio codec.

This codec encoder effectively transforms a speech waveform into a sequence of latent vectors, while a codec decoder faithfully restores the original waveform. By incorporating previous vectors obtained from a phoneme encoder, a duration predictor, and a pitch predictor, the researchers employ a diffusion model to construct these latent vectors.

The introduction of NaturalSpeech 2 marks a pivotal moment in TTS technology, as it showcases Microsoft’s commitment to pushing the boundaries of what is achievable in the realm of voice synthesis. With its powerful zero-shot capacity and the ability to infuse speech with nuanced prosody, this remarkable TTS system is poised to revolutionize the way we interact with synthesized voices. As the field of TTS continues to evolve, the future holds the promise of even greater advancements, bringing us closer to the seamless integration of synthetic speech into our daily lives.

In their paper, the researchers discuss several key design decisions that have contributed to the development of NaturalSpeech 2, revolutionizing the field of text-to-speech (TTS) technology. These decisions include:

1. Continuous Vectors Instead of Discrete Tokens: Previous TTS systems often employed residual quantizers to ensure speech quality, resulting in long discrete token sequences. However, the use of discrete tokens placed a heavy burden on the acoustic model. To address this, the team opted for continuous vectors instead of discrete tokens. This choice shortened the sequence length and provided more data for accurate speech reconstruction at a granular level.

2. Diffusion Models over Autoregressive Models: NaturalSpeech 2 replaces autoregressive models with diffusion models. This shift allows for improved reliability and stability, as well as enhanced zero-shot capacity. By employing diffusion models, the system can learn to generate speech in a contextually coherent manner, capturing the characteristics of the speech prompt.

3. Speech Prompting Mechanisms for In-Context Learning: The researchers developed speech prompting mechanisms to facilitate in-context learning within the diffusion model and pitch/duration predictors. These mechanisms enhance the zero-shot capacity of NaturalSpeech 2 by encouraging the diffusion models to adhere to the specific characteristics of the speech prompt, enabling accurate and contextually appropriate speech synthesis.

4. Single Acoustic Model for Improved Stability: Unlike its autoregressive predecessors, NaturalSpeech 2 relies on a single acoustic model, the diffusion model. This eliminates the need for a two-stage token prediction process and enables the system to extend its capabilities beyond speech synthesis. With the combination of duration/pitch prediction and non-autoregressive generation, NaturalSpeech 2 can be applied to diverse styles, including singing voices.

The researchers trained NaturalSpeech 2 using 400M model parameters and 44K hours of speech data. They conducted experiments to evaluate its performance in zero-shot scenarios, where only a few seconds of speech prompts were provided. The results demonstrated the superiority of NaturalSpeech 2 over previous powerful TTS systems. It generated natural speech with similar prosody to the speech prompt and ground-truth speech.

Moreover, its naturalness, as measured by CMOS (Comparative Mean Opinion Score), was comparable to or even better than that of the ground-truth speech on LibriTTS and VCTK test sets. Additionally, NaturalSpeech 2 showcased the ability to generate singing voices with novel timbre using short singing prompts or even solely speech prompts, unlocking true zero-shot singing synthesis.

Looking ahead, the research team plans to explore effective methods, such as consistency models, to accelerate the diffusion model. They also aim to delve into widespread speaking and singing voice training, thereby enabling more powerful mixed speaking and singing capabilities. These endeavors will further advance the capabilities of NaturalSpeech 2 and continue to drive the progress of TTS technology as a whole.

Conlcusion:

The introduction of NaturalSpeech 2, with its advanced latent diffusion models, is a significant step forward in the field of text-to-speech technology. With its improved zero-shot capacity and expressive prosodies, NaturalSpeech 2 has the potential to revolutionize the way we interact with synthesized voices. As this technology continues to advance, it opens up new possibilities for applications in industries such as customer service, virtual assistants, and media production.

The development of NaturalSpeech 2 also highlights the importance of diversity in data collection and its impact on the accuracy and naturalness of TTS systems. It is exciting to see the progress being made in this field, and we can expect to see continued advancements that will shape the future of TTS technology and its applications in the market.

Source