Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 11:11:85.
doi: 10.1186/1471-2105-11-85.

LINNAEUS: a species name identification system for biomedical literature

Affiliations

LINNAEUS: a species name identification system for biomedical literature

Martin Gerner et al. BMC Bioinformatics. .

Abstract

Background: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.

Results: In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.

Conclusions: LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the LINNAEUS species name identification system. (A) Schematic diagram of the species name dictionary and automaton construction. (B) Schematic of species names tagging and post-processing.
Figure 2
Figure 2
Number of articles per year in MEDLINE mentioning human, rat, mouse, cow, rabbit and HIV since 1975. Note that the rapid rise in mentions of the term HIV occurs just after its discovery in 1983 [59].

Similar articles

Cited by

References

    1. MEDLINE. http://www.nlm.nih.gov/databases/databases_medline.html
    1. PubMed Central. http://www.ncbi.nlm.nih.gov/pmc/
    1. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics. 2006;7(2):119–129. doi: 10.1038/nrg1768. - DOI - PubMed
    1. Krallinger M, Hirschman L, Valencia A. Current use of text mining and literature search systems for genome sciences. Genome Biology. 2008;9(Suppl 2):S8. doi: 10.1186/gb-2008-9-s2-s8. - DOI - PMC - PubMed
    1. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J. ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S14. doi: 10.1186/1471-2105-6-S1-S14. - DOI - PMC - PubMed

Publication types

LinkOut - more resources