The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Evangelos Pafilis¹, Sune P Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Aikaterini Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen

Affiliations

PMID: 23823062
PMCID: PMC3688812
DOI: 10.1371/journal.pone.0065390

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Evangelos Pafilis et al. PLoS One. 2013.

. 2013 Jun 18;8(6):e65390.

doi: 10.1371/journal.pone.0065390. Print 2013.

Authors

Evangelos Pafilis¹, Sune P Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Aikaterini Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen

Affiliation

¹ Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece.

PMID: 23823062
PMCID: PMC3688812
DOI: 10.1371/journal.pone.0065390

Abstract

The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Speed and memory efficiency of the LINNAEUS and SPECIES taggers.**
The major advantage of the SPECIES tagger over existing methods is its efficiency. Compared to the methodologically similar LINNAEUS tagger, it starts up and loads its dictionary 55× faster (6 seconds vs. 6 minutes 35 seconds), tags Medline abstracts 15× faster (0.26 vs. 4.05 seconds per 1000 documents), and uses 5× less memory in the process (0.5 GB vs. 3.0 GB).

**Figure 2. Precision and recall for separate S800 categories.**
Because the S800 corpus consists of seven different taxonomic categories (the eighth category is not taxonomic), it can provide insights into which types of species are hard to identify in text and which are easy. Plotting the precision and recall on each of the seven categories separately for both the LINNAEUS and the SPECIES tagger shows little difference between the taggers, but big differences between categories. It is clear that both methods are considerably worse at tagging names of viruses than at tagging cellular organisms, and that bacterial and fungal species—for which Linnaean nomenclature is primarily used—are the easiest to identify in text.

**Figure 3. The ORGANISMS web resource.**
The ORGANISMS web resource (http://organisms.jensenlab.org) aims to make the results of mining the biomedical literature for taxonomic names easily accessible to biologists. It currently covers 164,084 different taxa that can be queried by name. The screenshot shows an example of what is retrieved when searching for Metatheria; because the system is aware of synonyms as well as taxonomy, it correctly retrieved and tagged an abstract about the tammar wallaby.

See this image and copyright information in PMC

References

1. Lok C (2010) Literature mining: Speed reading. Nature 463: 416–418. - PubMed
1. Rinaldi A (2010) For I dipped into the future. EMBO reports 11: 345. - PMC - PubMed
1. Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nature reviews genetics 7: 119–129. - PubMed
1. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A (2008) Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome biology 9 Suppl 2: S4. - PMC - PubMed
1. Lu Z, Kao HY, Wei CH, Huang M, Liu J, et al. (2011) The gene normalization task in BioCreative III. BMC bioinformatics 12 Suppl 8: S2. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Affiliation

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases