AUBER: Automated BERT regularization

Hyun Dong Lee¹, Seongmin Lee², U Kang²

Affiliations

PMID: 34181664
PMCID: PMC8238198
DOI: 10.1371/journal.pone.0253241

AUBER: Automated BERT regularization

Hyun Dong Lee et al. PLoS One. 2021.

. 2021 Jun 28;16(6):e0253241.

doi: 10.1371/journal.pone.0253241. eCollection 2021.

Authors

Hyun Dong Lee¹, Seongmin Lee², U Kang²

Affiliations

¹ Columbia University, New York, NY, United States of America.
² Seoul National University, Seoul, Republic of Korea.

PMID: 34181664
PMCID: PMC8238198
DOI: 10.1371/journal.pone.0253241

Abstract

How can we effectively regularize BERT? Although BERT proves its effectiveness in various NLP tasks, it often overfits when there are only a small number of training instances. A promising direction to regularize BERT is based on pruning its attention heads with a proxy score for head importance. However, these methods are usually suboptimal since they resort to arbitrarily determined numbers of attention heads to be pruned and do not directly aim for the performance enhancement. In order to overcome such a limitation, we propose AUBER, an automated BERT regularization method, that leverages reinforcement learning to automatically prune the proper attention heads from BERT. We also minimize the model complexity and the action search space by proposing a low-dimensional state representation and dually-greedy approach for training. Experimental results show that AUBER outperforms existing pruning methods by achieving up to 9.58% better performance. In addition, the ablation study demonstrates the effectiveness of design choices for AUBER.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Performance of AUBER and its competitors on 4 GLUE datasets.**
AUBER successfully regularizes BERT model, enhancing the model performance up to 9.58%. AUBER provides the best performance among the state-of-the-art BERT attention head pruning methods.

**Fig 2. Overview of transitioning in AUBER.**
The figure shows the transition from *Layer* 2 to *Layer* 3 in AUBER with BERT-base.

**Fig 3. Overall flow of training AUBER on a layer.**
AUBER trains DQN to find out the attention heads that should be pruned for a better regularization following the illustrated steps.

**Fig 4. Performance after pruning the attention heads from each layer.**
AUBER consistently improves the model performance and achieves outstanding final performance, while all the other methods fail to enhance the model performance.

See this image and copyright information in PMC

References

1. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: NAACL-HLT; 2019.
1. Phang J, Févry T, Bowman SR. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. CoRR. 2018;abs/1811.01088.
1. Zhang T, Wu F, Katiyar A, Weinberger KQ, Artzi Y. Revisiting Few-sample BERT Fine-tuning. CoRR. 2020;abs/2006.05987.
1. Michel P, Levy O, Neubig G. Are Sixteen Heads Really Better than One? In: NeurIPS; 2019.
1. Voita E, Talbot D, Moiseev F, Sennrich R, Titov I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In: ACL; 2019.

Publication types

Actions

MeSH terms

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AUBER: Automated BERT regularization

Affiliations

AUBER: Automated BERT regularization

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources