Integrated visual transformer and flash attention for lip-to-speech generation GAN

Qiong Yang^{1

2}, Yuxuan Bai³, Feng Liu¹, Wei Zhang⁴

Affiliations

¹ School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.
² Shaanxi Key Laboratory of Clothing Intelligence, School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.
³ School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China. 18691711979@163.com.
⁴ China Mobile System Integration Co, Ltd, Xi'an, 710077, China.

PMID: 38402265
PMCID: PMC10894270
DOI: 10.1038/s41598-024-55248-6

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Qiong Yang et al. Sci Rep. 2024.

. 2024 Feb 24;14(1):4525.

doi: 10.1038/s41598-024-55248-6.

Authors

Qiong Yang^{1

2}, Yuxuan Bai³, Feng Liu¹, Wei Zhang⁴

Affiliations

¹ School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.
² Shaanxi Key Laboratory of Clothing Intelligence, School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.
³ School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China. 18691711979@163.com.
⁴ China Mobile System Integration Co, Ltd, Xi'an, 710077, China.

PMID: 38402265
PMCID: PMC10894270
DOI: 10.1038/s41598-024-55248-6

Abstract

Lip-to-Speech (LTS) generation is an emerging technology that is highly visible, widely supported, and rapidly evolving. LTS has a wide range of promising applications, including assisting speech impairment and improving speech interaction in virtual assistants and robots. However, the technique faces the following challenges: (1) Chinese lip-to-speech generation is poorly recognized. (2) The wide range of variation in lip-speaking is poorly aligned with lip movements. Addressing these challenges will contribute to advancing Lip-to-Speech (LTS) technology, enhancing the communication abilities, and improving the quality of life for individuals with disabilities. Currently, lip-to-speech generation techniques usually employ the GAN architecture but suffer from the following problems: The primary issue lies in the insufficient joint modeling of local and global lip movements, resulting in visual ambiguities and inadequate image representations. To solve these problems, we design Flash Attention GAN (FA-GAN) with the following features: (1) Vision and audio are separately coded, and lip motion is jointly modelled to improve speech recognition accuracy. (2) A multilevel Swin-transformer is introduced to improve image representation. (3) A hierarchical iterative generator is introduced to improve speech generation. (4) A flash attention mechanism is introduced to improve computational efficiency. Many experiments have indicated that FA-GAN can recognize Chinese and English datasets better than existing architectures, especially the recognition error rate of Chinese, which is only 43.19%, the lowest among the same type.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
A detailed overview of the overall framework of FA-GAN is provided. Through the multimodal attention module provided in this article to the hierarchical iteration generator, flash attention is introduced to reduce the computational burden brought by the hierarchical iteration generator.

**Figure 2**
(a) Architectural diagram depicting the principles of the Swin Transformer utilized in the framework. (b) Two successive Swin-transformer blocks: window-based multihead self-attention (W-MSA) and sliding window multi-head self-attention (SW-MSA).

**Figure 3**
Overview and modelling of the flash attention principle proposed in this article.

See this image and copyright information in PMC

References

1. Thézé R, Gadiri MA, Albert L, et al. Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments. Sci. Rep. 2020;10:15540. doi: 10.1038/s41598-020-72375-y. - DOI - PMC - PubMed
1. Kim M, Hong J, Ro YM. Lip to speech synthesis with visual context attentional GAN. NeurIPS. 2022 doi: 10.48550/arXiv.2204.01726. - DOI
1. Akinpelu S, Viriri S. Speech emotion classification using attention based network and regularized feature selection. Sci. Rep. 2023;13:11990. doi: 10.1038/s41598-0-23-38868-2. - DOI - PMC - PubMed
1. Lu Y, Tian H, Cheng J, et al. Decoding lip language using triboelectric sensors with deep learning. Nat. Commun. 2022;13:1401. doi: 10.1038/s41467-022-29083-0. - DOI - PMC - PubMed
1. Zhao DZ, Wang XK, Zhao T, et al. A swin transformer-based model for mosquito species identification. Sci. Rep. 2022;12:18664. doi: 10.1038/s41598-022-21017-6. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

2021JQ693/Shaanxi Natural Science Youth Foundation

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Affiliations

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources