Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 1;40(7):btae406.
doi: 10.1093/bioinformatics/btae406.

FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology

Affiliations

FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology

Tudor Groza et al. Bioinformatics. .

Abstract

Motivation: Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data.

Results: We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens-to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s.

Availability and implementation: FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
The evolution of the size of HPO in terms of number of concepts between December 2022 and February 2024. The line in the chart represents the growth in new terms being added to the ontology, which for this period was of 1303 terms, i.e. ∼7% increase.
Figure 2.
Figure 2.
High-level overview of the indexing process. Starting from the left, the labels and synonyms are tokenized, cleaned, and consolidated using the clusters of morphologically equivalent tokens. Sets of such cluster IDs are then serialized as representations of the ontology concepts in the index.
Figure 3.
Figure 3.
High-level overview of the concept recognition process. The text is tokenized and the tokens are looked up in the ontology index. The gaps left by the tokens absent from the index are used to identify candidates for entity linking.
Figure 4.
Figure 4.
High-level comparison between the GSC+ corpus and the new GSC 2024 corpus, using the top level HPO abnormalities as reference.

References

    1. Arbabi A, Adams DR, Fidler S. et al. Identifying clinical terms in medical text using ontology-guided machine learning. JMIR Med Inform 2019;7:e12596. - PMC - PubMed
    1. Boycott KM, Azzariti DR, Hamosh A. et al. Seven years since the launch of the matchmaker exchange: the evolution of genomic matchmaking. Hum Mutat 2022;43:659–67. - PMC - PubMed
    1. Clark MM, Stark Z, Farnaes L. et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med 2018;3:16. - PMC - PubMed
    1. Deisseroth CA, Birgmeier J, Bodle EE. et al.; Undiagnosed Diseases Network. ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet Med 2019;21:1585–93. - PMC - PubMed
    1. Feng Y, Qi L, Tian W.. PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology. IEEE/ACM Trans Comput Biol Bioinform 2023;20:1269–77. - PubMed

Publication types