Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 18;8(6):e65390.
doi: 10.1371/journal.pone.0065390. Print 2013.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Affiliations

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Evangelos Pafilis et al. PLoS One. .

Abstract

The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Speed and memory efficiency of the LINNAEUS and SPECIES taggers.
The major advantage of the SPECIES tagger over existing methods is its efficiency. Compared to the methodologically similar LINNAEUS tagger, it starts up and loads its dictionary 55× faster (6 seconds vs. 6 minutes 35 seconds), tags Medline abstracts 15× faster (0.26 vs. 4.05 seconds per 1000 documents), and uses 5× less memory in the process (0.5 GB vs. 3.0 GB).
Figure 2
Figure 2. Precision and recall for separate S800 categories.
Because the S800 corpus consists of seven different taxonomic categories (the eighth category is not taxonomic), it can provide insights into which types of species are hard to identify in text and which are easy. Plotting the precision and recall on each of the seven categories separately for both the LINNAEUS and the SPECIES tagger shows little difference between the taggers, but big differences between categories. It is clear that both methods are considerably worse at tagging names of viruses than at tagging cellular organisms, and that bacterial and fungal species—for which Linnaean nomenclature is primarily used—are the easiest to identify in text.
Figure 3
Figure 3. The ORGANISMS web resource.
The ORGANISMS web resource (http://organisms.jensenlab.org) aims to make the results of mining the biomedical literature for taxonomic names easily accessible to biologists. It currently covers 164,084 different taxa that can be queried by name. The screenshot shows an example of what is retrieved when searching for Metatheria; because the system is aware of synonyms as well as taxonomy, it correctly retrieved and tagged an abstract about the tammar wallaby.

References

    1. Lok C (2010) Literature mining: Speed reading. Nature 463: 416–418. - PubMed
    1. Rinaldi A (2010) For I dipped into the future. EMBO reports 11: 345. - PMC - PubMed
    1. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nature reviews genetics 7: 119–129. - PubMed
    1. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome biology 9 Suppl 2: S4. - PMC - PubMed
    1. Lu Z, Kao HY, Wei CH, Huang M, Liu J, et al. (2011) The gene normalization task in BioCreative III. BMC bioinformatics 12 Suppl 8: S2. - PMC - PubMed

Publication types