Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 26;5(1):194.
doi: 10.1038/s41746-022-00742-2.

A large language model for electronic health records

Affiliations

A large language model for electronic health records

Xi Yang et al. NPJ Digit Med. .

Abstract

There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model-GatorTron-using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og .

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Patient distribution by age, gender, race, ethnicity; clinical notes distribution by note type, and clinical department.
Ages were calculated as of September 2022.
Fig. 2
Fig. 2. Training loss and validation loss for GatorTron-base (345 million), medium (3.9 billion), and large (8.9 billion) models.
a Training loss. b Validation loss. MLM masked language modeling.
Fig. 3
Fig. 3. An overview of pretraining and fine-tuning of GatorTron models.
We loaded the base model and the medium model into one GPU for distributed training. We sliced the GatorTron-large model into 4 pieces and loaded model pieces to 4 GPUs for distributed training (i.e., model parallelism). TrM transformer unit.
Fig. 4
Fig. 4. Pretraining GatorTron-large model with 9 billion parameters using model parallelism.
Emb embedding, Tok Token from input sentence, Trm Transformer unit. [SEP]: a token defined in BERT to indicate sentence boundaries. [CLS]: a token defined in BERT for sentence-level representation.

References

    1. Adoption of Electronic Health Record Systems among U.S. Non-Federal Acute Care Hospitals: 2008–2015. ONC Data Brief. https://www.healthit.gov/sites/default/files/briefs/2015_hospital_adopti... (2016).
    1. Adler-Milstein J, et al. Electronic health record adoption in US hospitals: the emergence of a digital ‘advanced use’ divide. J. Am. Med. Inform. Assoc. 2017;24:1142–1148. doi: 10.1093/jamia/ocx080. - DOI - PMC - PubMed
    1. Bush RA, Kuelbs CL, Ryu J, Jian W, Chiang GJ. Structured data entry in the electronic medical record: perspectives of pediatric specialty physicians and surgeons. J. Med. Syst. 2017;41:1–8. doi: 10.1007/s10916-017-0716-5. - DOI - PMC - PubMed
    1. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb. Med. Inform. 2008;17:128–144. doi: 10.1055/s-0038-1638592. - DOI - PubMed
    1. Liang H, et al. Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence. Nat. Med. 2019;25:433–438. doi: 10.1038/s41591-018-0335-9. - DOI - PubMed