Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 27:2015:bav005.
doi: 10.1093/database/bav005. Print 2015.

Automatic concept recognition using the human phenotype ontology reference and test suite corpora

Affiliations

Automatic concept recognition using the human phenotype ontology reference and test suite corpora

Tudor Groza et al. Database (Oxford). .

Abstract

Concept recognition tools rely on the availability of textual corpora to assess their performance and enable the identification of areas for improvement. Typically, corpora are developed for specific purposes, such as gene name recognition. Gene and protein name identification are longstanding goals of biomedical text mining, and therefore a number of different corpora exist. However, phenotypes only recently became an entity of interest for specialized concept recognition systems, and hardly any annotated text is available for performance testing and training. Here, we present a unique corpus, capturing text spans from 228 abstracts manually annotated with Human Phenotype Ontology (HPO) concepts and harmonized by three curators, which can be used as a reference standard for free text annotation of human phenotypes. Furthermore, we developed a test suite for standardized concept recognition error analysis, incorporating 32 different types of test cases corresponding to 2164 HPO concepts. Finally, three established phenotype concept recognizers (NCBO Annotator, OBO Annotator and Bio-LarK CR) were comprehensively evaluated, and results are reported against both the text corpus and the test suites. The gold standard and test suites corpora are available from http://bio-lark.org/hpo_res.html. Database URL: http://bio-lark.org/hpo_res.html.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distribution of HPO test cases according to their types mapped to the top-level HPO categories. The larger the symbol, the more test case entries the corresponding mapping has. For example, the largest number of test case entries of Length-1 is present in Abnormality of the integument. In addition to providing an overview on the test suite content, this figure also depicts a birds-eye view over the variation in terms of characteristics of the concept lexical representations in the different top-level HPO categories. We can observe, e.g. that only a very few top-level categories contain concept labels with a length greater than 10. Similarly, metaphoric constructs seem to be present only in skeletal abnormalities, which also dominate together with the abnormalities of the integument and of the metabolism the range of labels containing punctuation.
Figure 2.
Figure 2.
Distribution of HPO annotations according to the top-level HPO categories. Two distributions are shown: an overall distribution that accounts for duplicate concept annotations (i.e. every instance of an annotation is counted), and a unique distribution that shows the counts of the unique concept annotations (i.e. every concept is counted a single time, indifferently of how many annotations exist in the corpus).
Figure 3.
Figure 3.
F-Score results achieved by the three systems on the HPO gold standard, distributed according to the HPO top-level category.
Figure 4.
Figure 4.
F-Score results achieved by the three systems on the HPO test suites, distributed according to the type of the test case.

References

    1. Robinson P.N., Köhler S., Bauer S., et al. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet., 83, 610–615. - PMC - PubMed
    1. Köhler S., Doelken S.C., Mungall C.J., et al. . (2014) The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res., 42, D966–D974. - PMC - PubMed
    1. Washington N.L., Haendel M.A., Mungall C.J., et al. (2009). Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol., 7, e1000247. - PMC - PubMed
    1. Köhler S., Doelken S.C., Ruef B.J., et al. . (2014) Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research. F1000Research, 2, 30, [v2; ref status: indexed, http://f1000r.es/2td]. - PMC - PubMed
    1. Chen C.K., Mungall C.J., Gkoutos G.V., et al. (2012) MouseFinder: candidate disease genes from mouse phenotype data. Hum. Mutat., 33, 858–866. - PMC - PubMed

Publication types

LinkOut - more resources