SoundStorm is a model for efficient, non-autoregressive audio generation that produces high-quality audio two orders of magnitude faster than traditional autoregressive generation approaches.
SoundStorm generates high-quality audio two orders of magnitude faster than traditional autoregressive generation approaches.
SoundStorm can be used for dialogue synthesis by coupling it with the text-to-semantic modeling stage of SPEAR-TTS, allowing for the synthesis of high-quality, natural dialogues.
SoundStorm-generated audio remains detectable by a dedicated classifier, with a detection rate of 98.5% using the same classifier as Borsos et al. (2022).
SoundStorm uses bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec.
SoundStorm produces high-quality audio that is comparable to traditional autoregressive generation approaches.
Dialogue synthesis for chatbots and virtual assistants
Audio generation for music and sound effects
Speech synthesis for audiobooks and podcasts
Voice cloning for voice assistants and virtual reality applications
Input the semantic tokens of AudioLM into SoundStorm
Use bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec
Couple SoundStorm with the text-to-semantic modeling stage of SPEAR-TTS for dialogue synthesis
Use SoundStorm for efficient audio generation in various applications