. 2022 Jul 23;22(15):5501.

doi: 10.3390/s22155501.

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Wentao Yu¹, Steffen Zeiler¹, Dorothea Kolossa¹

Affiliations

PMID: 35898005
PMCID: PMC9370936
DOI: 10.3390/s22155501

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Wentao Yu et al. Sensors (Basel). 2022.

. 2022 Jul 23;22(15):5501.

doi: 10.3390/s22155501.

Authors

Wentao Yu¹, Steffen Zeiler¹, Dorothea Kolossa¹

Affiliation

¹ Institute of Communication Acoustics, Ruhr University Bochum, 44801 Bochum, Germany.

PMID: 35898005
PMCID: PMC9370936
DOI: 10.3390/s22155501

Abstract

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

Keywords: audio-visual speech recognition; decision fusion net; end-to-end recognition; hybrid models; reliability measures.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Audio-visual fusion based on the DFN, applied to one stream of audio and two streams of video features.

**Figure 2**
Audio encoder (**left**), video encoder (**middle**) and reliability measure encoder (**right**) for both modalities $i \in A, VI$ . The blue blocks are used to align video features with audio features; the turquoise block shows the transformer encoder.

**Figure 3**
Transformer encoder for both modalities $i \in A, VI$ . The blue block shows the sub-sampling, whereas the turquoise blocks comprise the the transformer encoder.

**Figure 4**
Transformer decoder (**left**) and CTC decoder (**right**) for both modalities $i \in A, VI$ .

**Figure 5**
DFN fusion topology for E2E model, $t y p e \in s 2 s, c t c$ .

**Figure 6**
Decision fusion net structure for the hybrid model. The turquoise block indicates the successively repeated layers.

**Figure 7**
${DFN}_{c t c}$ (**left**) and ${DFN}_{s 2 s}$ (**right**). The turquoise blocks indicate the successively repeated layers.

**Figure 8**
Estimated log-posteriors of sentence S2 for the target state $s_{t}^{*}$ , with additive noise at $- 9$ dB. All abbreviations are the same as in Table 3. The whiskers show the maximum and minimum values; the upper and lower bounds of the green blocks represent the respective 25th and 75th percentile; the yellow line in the center of the green block indicates the median.

**Figure 9**
WER (%) on the test set of the LRS2 corpus in different noise conditions.

See this image and copyright information in PMC

References

1. Crosse M.J., DiLiberto G.M., Lalor E.C. Eye can hear clearly now: Inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration. J. Neurosci. 2016;36:9888–9895. doi: 10.1523/JNEUROSCI.1396-16.2016. - DOI - PMC - PubMed
1. McGurk H., MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–748. doi: 10.1038/264746a0. - DOI - PubMed
1. Potamianos G., Neti C., Luettin J., Matthews I. Audio-Visual Automatic Speech Recognition: An Overview. Issues in Visual and Audio-Visual Speech Processing. Volume 22. MIT Press; Cambridge, MA, USA: 2004. p. 23.
1. Wand M., Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv. 20171708.01565
1. Meutzner H., Ma N., Nickel R., Schymura C., Kolossa D. Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates; Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA. 5–9 March 2017; pp. 5320–5324.

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

KO3434/4-2/Deutsche Forschungsgemeinschaft

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Affiliation

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources