Comparative Study

. 2014 Aug 22;9(8):e105776.

doi: 10.1371/journal.pone.0105776. eCollection 2014.

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Rogan Carr¹, Elhanan Borenstein²

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, WA, United States of America.
² Department of Genome Sciences, University of Washington, Seattle, WA, United States of America; Department of Computer Science and Engineering, University of Washington, Seattle, WA, United States of America; Santa Fe Institute, Santa Fe, NM, United States of America.

PMID: 25148512
PMCID: PMC4141809
DOI: 10.1371/journal.pone.0105776

Comparative Study

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Rogan Carr et al. PLoS One. 2014.

. 2014 Aug 22;9(8):e105776.

doi: 10.1371/journal.pone.0105776. eCollection 2014.

Authors

Rogan Carr¹, Elhanan Borenstein²

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, WA, United States of America.
² Department of Genome Sciences, University of Washington, Seattle, WA, United States of America; Department of Computer Science and Engineering, University of Washington, Seattle, WA, United States of America; Santa Fe Institute, Santa Fe, NM, United States of America.

PMID: 25148512
PMCID: PMC4141809
DOI: 10.1371/journal.pone.0105776

Abstract

To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Overview of the analysis scheme.**
(a) Simulated sets of reads of length L are generated from curated and annotated reference genomes using a sliding window approach. The origin of each read is recorded, and reads are labeled to note whether they originated from genes associated with a KO (purple), from genes not associated with a KO (green), or from an intergenic region (gray). Each read is then annotated through a translated BLAST search against the KEGG database and the obtained annotation is compared to the annotation of the genome region from which the read was derived, to evaluate whether the correct gene and/or correct KO were recovered. (b) Evaluating the annotation of the *S. pneumoniae* genome. The inner ring represents the proportion of the genome annotated with KO genes, non-KO genes, and intergenic regions. The outer ring illustrates how reads originating from each such category were annotated, illustrating the accuracy of the annotations obtained by a BLAST-based search.

**Figure 2. The impact of phylogenetic coverage of the database on the performance of a translated BLAST-based annotation.**
(a) The inner ring represents the proportion of the *S. pneumonia* genome annotated with KO genes, non-KO genes, and intergenic regions, as in Figure 1. Outer Rings illustrate the annotations for reads from each such category obtained with varying levels of phylogenetic coverage, ranging from having the strain from which the reads originated present in the database, to having only other genomes from the same species, genus, or higher taxonomic levels present. (b) The impact of phylogenetic database coverage on the annotation of *B. fragilis*, *E. coli*, *S. elongatus*, and *M. maripaludis* genomes.

**Figure 3. The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life.**
The phylogenetic tree was obtained from Ref. . Colored rings represent the recall for identifying reads originating from a KO gene using the *top gene* protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10^th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for *E. coli*, *B. thetaiotaomicron*, and *S. pneumoniae*.

**Figure 4. Comparison of annotation protocols for analyzing BLAST results.**
The (a) precision and (b) recall are illustrated for several protocols for identifying reads originating from KO genes when the strain from which the reads originated is absent from the database. Genomes are ordered by their precision and recall, respectively, using the *top gene* protocol.

**Figure 5. Performance of BLAST-based annotation in recovering the functional profile of a complete set of reads.**
(a) The probability function of predicting the copy number of a given KO in a given dataset across all simulated 101-bp datasets using the *top gene* protocol and when the strain from which the reads originated is absent from the database. Only KOs with copy numbers 1 to 4 are illustrated. The curve corresponding to copy number 0 represents false positive KO predictions. The smaller peaks showing in some curves (e.g., the two extra peaks in the blue “1 copy” curve) were found to be due to stretches of intergenic reads that mismapped to KO genes in the database and likely reflect genomic misannotations or pseudogenes. (b) The average recall across all simulated 101-bp datasets for identifying reads originating from each KO, ranked from highest to lowest average recall. 95% confidence intervals are shown in green. Recall is calculated for the case where the strain from which the read originated is absent from the database.

**Figure 6. A principal component analysis of the pathway abundance profiles obtained for the 15 analyzed HMP samples and by the four different annotation protocols.**
HMP samples are numbered from 1 to 15 according to the list that appears in *Methods* . The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation.

See this image and copyright information in PMC

References

1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5: e77 10.1371/journal.pbio.0050077 - DOI - PMC - PubMed
1. Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR, et al. (2008) Evolution of mammals and their gut microbes. Science (80-) 320: 1647–1651 10.1126/science.1155725 - DOI - PMC - PubMed
1. Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS One 6: e18011 10.1371/journal.pone.0018011 - DOI - PMC - PubMed
1. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, et al. (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214 10.1038/nature11234 - DOI - PMC - PubMed
1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43 10.1038/nature02340 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Affiliations

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources