Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

doi:10.1109/TPAMI.2025.3597267

. 2025 Aug 8:PP.

doi: 10.1109/TPAMI.2025.3597267. Online ahead of print.

Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

Liang Li, Gaoxiang Cong, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Quan Z Sheng, Qingming Huang, Ming-Hsuan Yang

PMID: 40779382
DOI: 10.1109/TPAMI.2025.3597267

Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

Liang Li et al. IEEE Trans Pattern Anal Mach Intell. 2025.

. 2025 Aug 8:PP.

doi: 10.1109/TPAMI.2025.3597267. Online ahead of print.

Authors

Liang Li, Gaoxiang Cong, Yuankai Qi, Zheng-Jun Zha, Qi Wu, Quan Z Sheng, Qingming Huang, Ming-Hsuan Yang

PMID: 40779382
DOI: 10.1109/TPAMI.2025.3597267

Abstract

Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes). To address this issue, we propose a diffusion-based movie dubbing architecture, which improves pronunciation by Hierarchical Phoneme Modeling (HPM) and generates better mel-spectrogram through Acoustic Diffusion Denoising (ADD). We term our model as HD-Dubber. Specifically, our HPM bridges the visual information and corresponding speech prosody from three aspects: (1) aligning lip movement with the speech duration based on each phoneme unit by contrastive learning; (2) conveying facial expression to phoneme-level energy and pitch; and (3) injecting global emotions captured from video scenes into prosody. On the other hand, ADD exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via a parameterized Markov chain conditioned on textual phonemes and reference audio. ADD has two novel denoisers, the Style-adaptive Residual Denoiser (SRD) and the Phoneme-enhanced U-net Denoiser (PUD), to enhance speaker similarity and improve pronunciation quality. Extensive experimental results on the three benchmark datasets demonstrate the state-of-the-art performance of the proposed method. The source code and trained models will be made available to the public.

PubMed Disclaimer

LinkOut - more resources

Full Text Sources
- IEEE Computer Society
- IEEE Engineering in Medicine and Biology Society

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

Dubbing Movies via Hierarchical Phoneme Modeling and Acoustic Diffusion Denoising

Authors

Abstract

Similar articles

LinkOut - more resources

Full Text Sources

Abstract

Similar articles

Related information

LinkOut - more resources

Full Text Sources