A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Samir Sadok¹, Simon Leglaive², Laurent Girin³, Xavier Alameda-Pineda⁴, Renaud Séguier²

Affiliations

¹ CentraleSupélec IETR UMR CNRS 6164, France. Electronic address: samir.sadok@centralesupelec.fr.
² CentraleSupélec IETR UMR CNRS 6164, France.
³ Univ. Grenoble Alpes CNRS, Grenoble-INP, GIPSA-lab, France.
⁴ Inria, Univ. Grenoble Alpes CNRS, LJK, France.

PMID: 38266474
DOI: 10.1016/j.neunet.2024.106120

Free article

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Samir Sadok et al. Neural Netw. 2024 Apr.

Free article

. 2024 Apr:172:106120.

doi: 10.1016/j.neunet.2024.106120. Epub 2024 Jan 11.

Authors

Samir Sadok¹, Simon Leglaive², Laurent Girin³, Xavier Alameda-Pineda⁴, Renaud Séguier²

Affiliations

¹ CentraleSupélec IETR UMR CNRS 6164, France. Electronic address: samir.sadok@centralesupelec.fr.
² CentraleSupélec IETR UMR CNRS 6164, France.
³ Univ. Grenoble Alpes CNRS, Grenoble-INP, GIPSA-lab, France.
⁴ Inria, Univ. Grenoble Alpes CNRS, LJK, France.

PMID: 38266474
DOI: 10.1016/j.neunet.2024.106120

Abstract

High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

Keywords: Audiovisual speech processing; Deep generative modeling; Disentangled representation learning; Multimodal and dynamical data; Variational autoencoder.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Affiliations

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Authors

Affiliations

Abstract

Conflict of interest statement

Similar articles

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Similar articles

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous