Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 5;5(1):100887.
doi: 10.1016/j.patter.2023.100887. eCollection 2024 Jan 12.

Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT

Affiliations

Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT

Jingye Yang et al. Patterns (N Y). .

Abstract

To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models-PhenoBCBERT and PhenoGPT-for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models to automate the detection of phenotype terms, including those not in the current HPO. We compare these models with PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also show strong performance in case studies on biomedical literature. We evaluate the strengths and weaknesses of BERT- and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.

Keywords: BERT; GPT; Human Phenotype Ontology; clinical notes; electronic health records; named entity recognition; transformer.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Illustration of the workflow of the project
Figure 2
Figure 2
Illustration of the BERT-based and GPT-based model used in this study for a sentence with phenotype mention (A) Conversion of input sequence into a combination of three embeddings (sentence embedding is treated as zero in this study). (B) The pre-training and fine-tuning strategy for PhenoBCBERT. (C) The pre-training and fine-tuning strategy for PhenoGPT.
Figure 3
Figure 3
Examples of phenotype terms from eight clinical notes recognized by PhenoTagger and PhenoBCBERT The font size is relative to the frequency of appearance.
Figure 4
Figure 4
Examples of phenotype terms from clinical notes recognized by different series of GPT models GPT comparison 1: prediction results after prompt-based learning. GPT comparison 2–6: prediction results after fine-tuning. GPT-3 (N), GPT-3 fine-tuned based on N instances.
Figure 5
Figure 5
Case studies of the predicted phenotype entities with PhenoTagger, PhenoBCBERT, and PhenoGPT (GPT-3) The negation terms and misspelled terms from the original published manuscript are highlighted.

Update of

References

    1. Marwaha S., Knowles J.W., Ashley E.A. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 2022;14:23. - PMC - PubMed
    1. Groft S.C., Posada M., Taruscio D. Progress, challenges and global approaches to rare diseases. Acta Paediatr. 2021;110:2711–2716. - PubMed
    1. Zanello G., Chan C.H., Pearce D.A., IRDiRC Working Group Recommendations from the IRDiRC Working Group on methodologies to assess the impact of diagnoses and therapies on rare disease patients. Orphanet J. Rare Dis. 2022;17:181. - PMC - PubMed
    1. Smedley D., Robinson P.N. Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome Med. 2015;7:81. - PMC - PubMed
    1. Hartley T., Lemire G., Kernohan K.D., Howley H.E., Adams D.R., Boycott K.M. New Diagnostic Approaches for Undiagnosed Rare Genetic Diseases. Annu. Rev. Genom. Hum. Genet. 2020;21:351–372. - PubMed

LinkOut - more resources