Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 9:13:161.
doi: 10.1186/1471-2105-13-161.

Concept annotation in the CRAFT corpus

Affiliations

Concept annotation in the CRAFT corpus

Michael Bada et al. BMC Bioinformatics. .

Abstract

Background: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

Results: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

Conclusions: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

PubMed Disclaimer

Figures

Figure 1
Figure 1
IAA statistics for ChEBI and GO BP/MF, and GO CC markup. Plot of IAA versus number of training sessions/meetings (approximately weekly) for annotation of the corpus with the ChEBI ontology, GO BP & MF, and CC. IAA has been calculated as F-score, which is the harmonic mean of precision and recall.
Figure 2
Figure 2
IAA statistics for CL, NCBITaxon, and SO markup. Plot of IAA versus number of training sessions/meetings (approximately weekly) for annotation of the corpus with the SO, CL, and NCBI Taxonomy. IAA has been calculated as F-score, which is the harmonic mean of precision and recall.

References

    1. Ananiadou S, McNaught J. Text Mining for Biology and Biomedicine. Artech House, Boston, London; 2006.
    1. Hunter L, Cohen KB. Biomedical Language Processing: What’s Beyond PubMed? Mol Cell. 2006;21(5):589–594. doi: 10.1016/j.molcel.2006.02.012. - DOI - PMC - PubMed
    1. Jensen LJ, Šarić J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 2006;7:119–129. doi: 10.1038/nrg1768. - DOI - PubMed
    1. Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007;8(5):358–375. doi: 10.1093/bib/bbm045. - DOI - PMC - PubMed
    1. Hersh W. Information retrieval: a health and biomedical perspective. 3. Springer, ; 2008.

Publication types