DreamTalk consists of a denoising network, a style-aware lip expert, and a style predictor to produce high-quality audio-driven face motions. It can generate photo-realistic talking faces with diverse speaking styles and achieve accurate lip motions.
A diffusion-based denoising network that synthesizes high-quality audio-driven face motions across diverse expressions.
A lip expert that guides lip-sync while being mindful of the speaking styles to enhance the expressiveness and accuracy of lip motions.
A diffusion-based style predictor that predicts the target expression directly from the audio, eliminating the need for expression reference video or text.
Generate photo-realistic talking faces with diverse speaking styles
Achieve accurate lip motions in audio-driven face motions
Eliminate the need for expensive style references with the style predictor
Harness the power of diffusion models in generating expressive talking heads
Download the code and checkpoints from GitHub
Install the required dependencies and libraries
Configure the denoising network, style-aware lip expert, and style predictor
Run the DreamTalk framework to generate expressive talking heads