Single non-autoregressive transformer, masked generative sequence modeling, novel rescoring method, hybrid version combining autoregressive and non-autoregressive modeling.
MAGNeT uses a single non-autoregressive transformer to generate high-quality audio, operating directly on multiple streams of audio tokens.
MAGNeT predicts spans of masked tokens during training and gradually constructs the output sequence during inference.
MAGNeT uses a novel rescoring method to enhance generated audio quality, leveraging an external pre-trained model to rescore and rank predictions.
MAGNeT offers a hybrid version that combines autoregressive and non-autoregressive modeling, allowing for flexible and efficient audio generation.
MAGNeT can be trained with restricted temporal context, allowing for more efficient and effective audio generation.
Text-to-music generation
Text-to-audio generation
Music generation
Audio generation
Speech synthesis
Train MAGNeT on a large dataset of audio samples
Use the trained model to generate high-quality audio
Experiment with different hyperparameters and architectures to optimize performance
Use the hybrid version to combine autoregressive and non-autoregressive modeling
Restrict temporal context to improve training efficiency