Meta Unveils Voicebox AI: The Dall-E of Text-to-Speech Technology

TL;DR:

  • Meta introduces Voicebox, a generative text-to-speech model.
  • Voicebox uses a non-autoregressive flow-matching model to generate audio clips based on input text.
  • Trained on 50,000+ hours of unfiltered audio, including multiple languages.
  • Generates conversational speech and performs almost as well as models trained on real speech.
  • Capable of infilling speech segments and actively editing audio clips.
  • Voicebox employs a novel zero-shot text-to-speech training method called Flow Matching.
  • Outperforms existing TTS models in intelligibility and audio similarity.
  • Currently not available to the public due to potential misuse.
  • Voicebox’s future applications include prosthetics, in-game NPCs, and digital assistants.

Main AI News:

In a bold leap towards the future of celebrity immortality, Meta has introduced Voicebox, a groundbreaking generative text-to-speech model that promises to redefine the spoken word, much like ChatGPT and Dall-E revolutionized text and image generation. Voicebox functions as a text-to-output generator, except it specializes in producing audio clips that faithfully represent the input text. Meta describes it as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” This cutting-edge technology has been trained on an extensive corpus of over 50,000 hours of unfiltered audio, utilizing recorded speech and transcripts from a wide range of public domain audiobooks written in English, French, Spanish, German, Polish, and Portuguese.

The utilization of such a diverse dataset enables Voicebox to generate speech that sounds remarkably conversational, irrespective of the languages spoken by the interacting parties. Research conducted by Meta demonstrates that speech recognition models trained on Voicebox-generated synthetic speech perform nearly as well as those trained on real speech. Remarkably, the computer-generated speech exhibits a mere 1 percent error rate degradation, compared to the substantial 45 to 70 percent drop-off observed in existing text-to-speech (TTS) models.

Initially, the system was trained to predict speech segments based on the context surrounding them, as well as the transcript of the passage. Having mastered the art of infilling speech from context, the model can now apply this knowledge to various speech generation tasks, including seamlessly generating segments within an audio recording without necessitating the recreation of the entire input. This innovative capability showcases the tremendous potential of Voicebox.

Furthermore, Voicebox boasts the remarkable ability to actively edit audio clips, effectively eliminating background noise or even replacing misspoken words. Users can identify problematic segments within the speech, such as instances of disruptive noise, crop them, and instruct the model to regenerate the affected sections. It is akin to utilizing image-editing software to enhance and refine photographs, but for audio.

Text-to-speech generators have been in existence for quite some time, allowing devices like your parents’ TomToms to offer driving directions in the comforting voice of Morgan Freeman. While modern iterations like Speechify or Elevenlab’s Prime Voice AI have improved significantly, they still require substantial amounts of source material to accurately mimic their subjects. Moreover, the training process necessitates an additional wealth of data for each distinct subject one wishes to train the model on.

Voicebox, however, is a game-changer. Thanks to an innovative zero-shot text-to-speech training method dubbed Flow Matching, Meta’s AI system surpasses the current state-of-the-art solutions by a wide margin. Voicebox reportedly outperforms existing models in terms of intelligibility, achieving an impressive 1.9 percent word error rate compared to the industry average of 5.9 percent. Additionally, its “audio similarity” composite score of 0.681 surpasses the state of the art’s 0.580. As if these achievements were not enough, Voicebox operates at speeds up to 20 times faster than today’s best TTS systems.

Despite these groundbreaking advancements, the Voicebox app and its source code are not currently available to the public. Meta has chosen not to release them due to concerns about potential misuse, despite recognizing the numerous exciting use cases for generative speech models. However, the company has provided a series of audio examples and published an initial research paper outlining the program’s capabilities. Moving forward, Meta’s research team envisions Voicebox technology being integrated into prosthetics for patients with vocal cord damage, in-game non-player characters (NPCs), and digital assistants. The future possibilities are truly exciting.

Conclusion:

Meta’s Voicebox represents a significant advancement in the field of text-to-speech technology. Its ability to generate natural-sounding, conversational speech across multiple languages is impressive. Voicebox’s performance, almost on par with models trained on real speech, and its capability to actively edit audio clips demonstrate its potential for various applications. The innovative zero-shot training method, Flow Matching, sets Voicebox apart from existing TTS systems, outperforming them in terms of intelligibility and audio similarity. Although not currently released to the public, the research paper and audio examples provide a glimpse into the remarkable capabilities of Voicebox. This development opens up new possibilities in industries such as assistive technology, gaming, and virtual assistants, paving the way for exciting market opportunities in the future.

Source