Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Nov 9:arXiv:2308.06294v2.

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Affiliations

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Jingye Yang et al. ArXiv. .

Update in

Abstract

To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models - PhenoBCBERT and PhenoGPT - for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes, due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models (LLMs) to automate the detection of phenotype terms, including those not in the current HPO. We compared these models to PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also showed strong performance in case studies on biomedical literature. We evaluated the strengths and weaknesses of BERT-based and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.

Keywords: Human Phenotype Ontology; clinical notes; electronic health records; named entity recognition; transformer.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS The authors declare no competing interests.

Figures

Figure 1.
Figure 1.
Illustration of the workflow of the project.
Figure 2.
Figure 2.
Illustration of the BERT-based and GPT-based model used in the current study for a sentence with phenotype mention. (A) Conversion of input sequence into a combination of three embeddings (sentence embedding is treated as zero in the current study). (B) The pre-training and fine-tuning strategy for PhenoBCBERT. (C) The pre-training and fine-tuning strategy for PhenoGPT.
Figure 3.
Figure 3.
Examples of phenotype terms from 8 clinical notes recognized by PhenoTagger and PhenoBCBERT. The font size is relative to the frequency of appearance.
Figure 4.
Figure 4.
Examples of phenotype terms from clinical notes recognized by different series of GPT models. GPT comparison 1: prediction results after prompt-based learning. GPT comparison 26: prediction results after fine-tuning. GPT-3 (N): GPT-3 fine-tuned based on N instances.
Figure 5.
Figure 5.
Case studies of the predicted phenotype entities with PhenoTagger, PhenoBCBERT and PhenoGPT (GPT-3). The negation terms and misspelled terms from the original published manuscript are highlighted.

References

    1. Marwaha S., Knowles J.W., and Ashley E.A. (2022). A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med 14, 23. 10.1186/s13073-022-01026-w. - DOI - PMC - PubMed
    1. Groft S.C., Posada M., and Taruscio D. (2021). Progress, challenges and global approaches to rare diseases. Acta Paediatr 110, 2711–2716. 10.1111/apa.15974. - DOI - PubMed
    1. Zanello G., Chan C.H., Pearce D.A., and Group I.R.W. (2022). Recommendations from the IRDiRC Working Group on methodologies to assess the impact of diagnoses and therapies on rare disease patients. Orphanet J Rare Dis 17, 181. 10.1186/s13023-022-02337-2. - DOI - PMC - PubMed
    1. Smedley D., and Robinson P.N. (2015). Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome Med 7, 81. 10.1186/s13073-015-0199-2 - DOI - PMC - PubMed
    1. Hartley T., Lemire G., Kernohan K.D., Howley H.E., Adams D.R., and Boycott K.M. (2020). New Diagnostic Approaches for Undiagnosed Rare Genetic Diseases. Annu Rev Genomics Hum Genet 21, 351–372. 10.1146/annurevgenom-083118-015345. - DOI - PubMed

Publication types

LinkOut - more resources