Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 23;22(15):5501.
doi: 10.3390/s22155501.

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Affiliations

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Wentao Yu et al. Sensors (Basel). .

Abstract

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.

Keywords: audio-visual speech recognition; decision fusion net; end-to-end recognition; hybrid models; reliability measures.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Audio-visual fusion based on the DFN, applied to one stream of audio and two streams of video features.
Figure 2
Figure 2
Audio encoder (left), video encoder (middle) and reliability measure encoder (right) for both modalities iA,VI. The blue blocks are used to align video features with audio features; the turquoise block shows the transformer encoder.
Figure 3
Figure 3
Transformer encoder for both modalities iA,VI. The blue block shows the sub-sampling, whereas the turquoise blocks comprise the the transformer encoder.
Figure 4
Figure 4
Transformer decoder (left) and CTC decoder (right) for both modalities iA,VI.
Figure 5
Figure 5
DFN fusion topology for E2E model, types2s,ctc.
Figure 6
Figure 6
Decision fusion net structure for the hybrid model. The turquoise block indicates the successively repeated layers.
Figure 7
Figure 7
DFNctc (left) and DFNs2s (right). The turquoise blocks indicate the successively repeated layers.
Figure 8
Figure 8
Estimated log-posteriors of sentence S2 for the target state st*, with additive noise at 9 dB. All abbreviations are the same as in Table 3. The whiskers show the maximum and minimum values; the upper and lower bounds of the green blocks represent the respective 25th and 75th percentile; the yellow line in the center of the green block indicates the median.
Figure 9
Figure 9
WER (%) on the test set of the LRS2 corpus in different noise conditions.

References

    1. Crosse M.J., DiLiberto G.M., Lalor E.C. Eye can hear clearly now: Inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration. J. Neurosci. 2016;36:9888–9895. doi: 10.1523/JNEUROSCI.1396-16.2016. - DOI - PMC - PubMed
    1. McGurk H., MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–748. doi: 10.1038/264746a0. - DOI - PubMed
    1. Potamianos G., Neti C., Luettin J., Matthews I. Audio-Visual Automatic Speech Recognition: An Overview. Issues in Visual and Audio-Visual Speech Processing. Volume 22. MIT Press; Cambridge, MA, USA: 2004. p. 23.
    1. Wand M., Schmidhuber J. Improving speaker-independent lipreading with domain-adversarial training. arXiv. 20171708.01565
    1. Meutzner H., Ma N., Nickel R., Schymura C., Kolossa D. Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates; Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); New Orleans, LA, USA. 5–9 March 2017; pp. 5320–5324.

LinkOut - more resources