Regularizing transformers with deep probabilistic layers

Aurora Cobo Aguilera¹, Pablo M Olmos², Antonio Artés-Rodríguez³, Fernando Pérez-Cruz⁴

Affiliations

¹ Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: acobo@tsc.uc3m.es.
² Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: pamartin@ing.uc3m.es.
³ Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: aartes@ing.uc3m.es.
⁴ Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland. Electronic address: fernando.perezcruz@sdsc.ethz.ch.

PMID: 36812832
DOI: 10.1016/j.neunet.2023.01.032

Regularizing transformers with deep probabilistic layers

Aurora Cobo Aguilera et al. Neural Netw. 2023 Apr.

. 2023 Apr:161:565-574.

doi: 10.1016/j.neunet.2023.01.032. Epub 2023 Feb 9.

Authors

Aurora Cobo Aguilera¹, Pablo M Olmos², Antonio Artés-Rodríguez³, Fernando Pérez-Cruz⁴

Affiliations

¹ Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: acobo@tsc.uc3m.es.
² Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: pamartin@ing.uc3m.es.
³ Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: aartes@ing.uc3m.es.
⁴ Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland. Electronic address: fernando.perezcruz@sdsc.ethz.ch.

PMID: 36812832
DOI: 10.1016/j.neunet.2023.01.032

Abstract

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.

Keywords: Deep learning; Missing data; Natural language processing; Regularization; Transformers; Variational auto-encoder.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Regularizing transformers with deep probabilistic layers

Affiliations

Regularizing transformers with deep probabilistic layers

Authors

Affiliations

Abstract

Conflict of interest statement

MeSH terms

LinkOut - more resources

Full Text Sources