Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 18;30(2):340-347.
doi: 10.1093/jamia/ocac225.

A comparative study of pretrained language models for long clinical text

Affiliations

A comparative study of pretrained language models for long clinical text

Yikuan Li et al. J Am Med Inform Assoc. .

Abstract

Objective: Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts.

Materials and methods: Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks.

Results: The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results.

Discussion: Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer.

Conclusion: This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.

Keywords: clinical natural language processing; named entity recognition; natural language inference; question answering; text classification.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The pipeline for pretraining and fine-tuning transformer-based language models.

References

    1. Brown T, et al. Language models are few-shot learners. Adv Neural Inform Process Syst 2020; 33: 1877–901.
    1. Devlin J, et al. Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Minneapolis, MN: Association for Computational Linguistics; 2019: 4171–86.
    1. Liu Y, et al. Roberta: a robustly optimized Bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    1. Yao L, Jin Z, Mao C, et al. Traditional Chinese medicine clinical records classification with BERT and domain specific corpora. J Am Med Inform Assoc 2019; 26 (12): 1632–6. - PMC - PubMed
    1. Zhang Z, Liu J, Razavian N. BERT-XML: large scale automated ICD coding using BERT pretraining. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop; 2020.

Publication types