Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
- PMID: 35898005
- PMCID: PMC9370936
- DOI: 10.3390/s22155501
Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
Abstract
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.
Keywords: audio-visual speech recognition; decision fusion net; end-to-end recognition; hybrid models; reliability measures.
Conflict of interest statement
The authors declare no conflict of interest.
Figures
References
-
- Potamianos G., Neti C., Luettin J., Matthews I. Audio-Visual Automatic Speech Recognition: An Overview. Issues in Visual and Audio-Visual Speech Processing. Volume 22. MIT Press; Cambridge, MA, USA: 2004. p. 23.
-
- Wand M., Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv. 20171708.01565
-
- Meutzner H., Ma N., Nickel R., Schymura C., Kolossa D. Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates; Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA. 5–9 March 2017; pp. 5320–5324.
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
