TL;DR:
- Transformer-based modeling techniques have advanced audio production.
- SoundStorm addresses challenges in generating long audio token sequences.
- It employs attention-based models and non-autoregressive decoding.
- Custom architectures tailored to neural audio codecs are crucial.
- SoundStorm’s hierarchical token structure enables accurate factorization.
- It achieves rapid and efficient audio generation without compromising quality.
- SoundStorm complements AudioLM’s acoustic generator while being significantly faster.
- Combining SoundStorm with SPEAR-TTS enables lifelike conversations.
- SoundStorm synthesizes 30-second talks in just 2 seconds on a single TPU-v4.
Main AI News:
The audio production industry has seen remarkable advancements in recent years, thanks to the application of sophisticated Transformer-based sequence-to-sequence modeling techniques. These techniques have enabled the development of neural codecs that generate discrete representations of audio, leading to significant progress in speech continuation, text-to-speech, and overall audio and music creation.
To ensure the production of high-quality audio, it is crucial to enhance the pace of generating discrete representations by modeling the tokens of a neural codec. However, this approach presents challenges such as an exponential increase in codebook size or extended token sequences, which pose computational difficulties for autoregressive models due to memory constraints.
Researchers at Google have conducted a study with a primary focus on addressing these challenges and developing SoundStorm, a cutting-edge audio creation technique. The key objective is to devise attention-based models that minimize runtime complexity when calculating self-attention for long audio sequences. Achieving a balance between perceived quality and runtime is a significant hurdle in audio creation. To tackle this issue, at least three approaches can be employed individually or in combination:
- Effective attention mechanisms: Implementing attention mechanisms that optimize the processing of audio token sequences, ensuring high-quality audio generation.
- Non-autoregressive, parallel decoding schemes: Utilizing decoding schemes that allow for parallel processing, enabling efficient generation of audio token sequences without relying on autoregressive models.
- Custom architectures for neural audio codecs: Developing specialized architectures tailored to the unique properties of tokens produced by neural audio codecs, enhancing the overall audio modeling process.
In the future, the unique structure of audio token sequences holds immense potential for advancements in long-sequence audio modeling. However, further improvements are required to ensure the effective generation of lengthy, high-quality audio segments when modeling token sequences produced by neural audio codecs, regardless of whether they are unconditional or based on weak conditioning, such as text.
Specifically, SoundStream and EnCodec employ Residual Vector Quantization (RVQ) to quantize compressed audio frames. Each quantizer operates on the residual of the preceding one, and the number of quantizers determines the total bitrate. It is important to note that tokens from smaller RVQ levels contribute less to the perceived quality, necessitating models and decoding strategies that consider this unique input structure for optimal training and inference.
To address these challenges, the researchers introduce SoundStorm, a rapid and efficient audio creation technique. SoundStorm leverages an architecture designed specifically for the hierarchical structure of audio tokens, along with a parallel, non-autoregressive, confidence-based decoding scheme for residual vector quantized token sequences. This innovative approach effectively resolves the issue of generating long audio token sequences. By incorporating a hierarchical token structure, SoundStorm facilitates accurate factorizations and estimations of the joint distribution of the token sequence.
To predict masked audio tokens generated by SoundStream, SoundStorm employs a bidirectional attention-based Conformer. This Conformer ensures that the internal sequence length for self-attention matches the number of SoundStream frames, irrespective of the number of quantizers in the RVQ.
By summing the tokens’ embeddings corresponding to a single SoundStream frame on the input side, SoundStorm makes predictions for the masked target tokens using separate heads that operate at each RVQ level. During the inference process, all audio tokens are initially masked out, and given the conditioning signal, the masked tokens are gradually filled in RVQ level-by-level across multiple iterations. Within each level, multiple tokens are predicted concurrently throughout an iteration.
To support this inference scheme, the researchers propose a training masking approach that replicates the inference process. They demonstrate that SoundStorm can effectively replace both stage two and stage three of AudioLM’s acoustic generator. In terms of speaker identification and acoustic circumstances, SoundStorm generates audio approximately two orders of magnitude faster than AudioLM’s hierarchical autoregressive acoustic generator, while maintaining comparable quality.
Furthermore, the researchers showcase how SoundStorm, when combined with the text-to-semantic modeling step of SPEAR-TTS, can create high-quality and lifelike conversations, allowing for efficient management of spoken material, voice, and turn-taking. Notably, SoundStorm achieves impressive runtime performance, synthesizing 30-second talks in just 2 seconds on a single TPU-v4.
Conlcusion:
The development of SoundStorm and its groundbreaking techniques for audio creation marks a significant milestone in the market. The use of Transformer-based modeling and attention-based models, along with the implementation of hierarchical token structures and non-autoregressive decoding, has revolutionized the process of generating high-quality audio.
SoundStorm’s ability to deliver impressive runtime performance while maintaining comparable quality to existing methods presents tremendous opportunities for various industries reliant on audio production. With its potential to significantly reduce production time and enhance the overall user experience, SoundStorm is poised to reshape the audio production market and unlock new possibilities in areas such as speech generation, text-to-speech applications, and conversational AI.