Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 24;14(1):4525.
doi: 10.1038/s41598-024-55248-6.

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Affiliations

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Qiong Yang et al. Sci Rep. .

Abstract

Lip-to-Speech (LTS) generation is an emerging technology that is highly visible, widely supported, and rapidly evolving. LTS has a wide range of promising applications, including assisting speech impairment and improving speech interaction in virtual assistants and robots. However, the technique faces the following challenges: (1) Chinese lip-to-speech generation is poorly recognized. (2) The wide range of variation in lip-speaking is poorly aligned with lip movements. Addressing these challenges will contribute to advancing Lip-to-Speech (LTS) technology, enhancing the communication abilities, and improving the quality of life for individuals with disabilities. Currently, lip-to-speech generation techniques usually employ the GAN architecture but suffer from the following problems: The primary issue lies in the insufficient joint modeling of local and global lip movements, resulting in visual ambiguities and inadequate image representations. To solve these problems, we design Flash Attention GAN (FA-GAN) with the following features: (1) Vision and audio are separately coded, and lip motion is jointly modelled to improve speech recognition accuracy. (2) A multilevel Swin-transformer is introduced to improve image representation. (3) A hierarchical iterative generator is introduced to improve speech generation. (4) A flash attention mechanism is introduced to improve computational efficiency. Many experiments have indicated that FA-GAN can recognize Chinese and English datasets better than existing architectures, especially the recognition error rate of Chinese, which is only 43.19%, the lowest among the same type.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
A detailed overview of the overall framework of FA-GAN is provided. Through the multimodal attention module provided in this article to the hierarchical iteration generator, flash attention is introduced to reduce the computational burden brought by the hierarchical iteration generator.
Figure 2
Figure 2
(a) Architectural diagram depicting the principles of the Swin Transformer utilized in the framework. (b) Two successive Swin-transformer blocks: window-based multihead self-attention (W-MSA) and sliding window multi-head self-attention (SW-MSA).
Figure 3
Figure 3
Overview and modelling of the flash attention principle proposed in this article.

References

    1. Thézé R, Gadiri MA, Albert L, et al. Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments. Sci. Rep. 2020;10:15540. doi: 10.1038/s41598-020-72375-y. - DOI - PMC - PubMed
    1. Kim M, Hong J, Ro YM. Lip to speech synthesis with visual context attentional GAN. NeurIPS. 2022 doi: 10.48550/arXiv.2204.01726. - DOI
    1. Akinpelu S, Viriri S. Speech emotion classification using attention based network and regularized feature selection. Sci. Rep. 2023;13:11990. doi: 10.1038/s41598-0-23-38868-2. - DOI - PMC - PubMed
    1. Lu Y, Tian H, Cheng J, et al. Decoding lip language using triboelectric sensors with deep learning. Nat. Commun. 2022;13:1401. doi: 10.1038/s41467-022-29083-0. - DOI - PMC - PubMed
    1. Zhao DZ, Wang XK, Zhao T, et al. A swin transformer-based model for mosquito species identification. Sci. Rep. 2022;12:18664. doi: 10.1038/s41598-022-21017-6. - DOI - PMC - PubMed