Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2014 Aug 22;9(8):e105776.
doi: 10.1371/journal.pone.0105776. eCollection 2014.

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Affiliations
Comparative Study

Comparative analysis of functional metagenomic annotation and the mappability of short reads

Rogan Carr et al. PLoS One. .

Abstract

To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of the analysis scheme.
(a) Simulated sets of reads of length L are generated from curated and annotated reference genomes using a sliding window approach. The origin of each read is recorded, and reads are labeled to note whether they originated from genes associated with a KO (purple), from genes not associated with a KO (green), or from an intergenic region (gray). Each read is then annotated through a translated BLAST search against the KEGG database and the obtained annotation is compared to the annotation of the genome region from which the read was derived, to evaluate whether the correct gene and/or correct KO were recovered. (b) Evaluating the annotation of the S. pneumoniae genome. The inner ring represents the proportion of the genome annotated with KO genes, non-KO genes, and intergenic regions. The outer ring illustrates how reads originating from each such category were annotated, illustrating the accuracy of the annotations obtained by a BLAST-based search.
Figure 2
Figure 2. The impact of phylogenetic coverage of the database on the performance of a translated BLAST-based annotation.
(a) The inner ring represents the proportion of the S. pneumonia genome annotated with KO genes, non-KO genes, and intergenic regions, as in Figure 1. Outer Rings illustrate the annotations for reads from each such category obtained with varying levels of phylogenetic coverage, ranging from having the strain from which the reads originated present in the database, to having only other genomes from the same species, genus, or higher taxonomic levels present. (b) The impact of phylogenetic database coverage on the annotation of B. fragilis, E. coli, S. elongatus, and M. maripaludis genomes.
Figure 3
Figure 3. The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life.
The phylogenetic tree was obtained from Ref. . Colored rings represent the recall for identifying reads originating from a KO gene using the top gene protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for E. coli, B. thetaiotaomicron, and S. pneumoniae.
Figure 4
Figure 4. Comparison of annotation protocols for analyzing BLAST results.
The (a) precision and (b) recall are illustrated for several protocols for identifying reads originating from KO genes when the strain from which the reads originated is absent from the database. Genomes are ordered by their precision and recall, respectively, using the top gene protocol.
Figure 5
Figure 5. Performance of BLAST-based annotation in recovering the functional profile of a complete set of reads.
(a) The probability function of predicting the copy number of a given KO in a given dataset across all simulated 101-bp datasets using the top gene protocol and when the strain from which the reads originated is absent from the database. Only KOs with copy numbers 1 to 4 are illustrated. The curve corresponding to copy number 0 represents false positive KO predictions. The smaller peaks showing in some curves (e.g., the two extra peaks in the blue “1 copy” curve) were found to be due to stretches of intergenic reads that mismapped to KO genes in the database and likely reflect genomic misannotations or pseudogenes. (b) The average recall across all simulated 101-bp datasets for identifying reads originating from each KO, ranked from highest to lowest average recall. 95% confidence intervals are shown in green. Recall is calculated for the case where the strain from which the read originated is absent from the database.
Figure 6
Figure 6. A principal component analysis of the pathway abundance profiles obtained for the 15 analyzed HMP samples and by the four different annotation protocols.
HMP samples are numbered from 1 to 15 according to the list that appears in Methods . The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation.

References

    1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5: e77 10.1371/journal.pbio.0050077 - DOI - PMC - PubMed
    1. Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR, et al. (2008) Evolution of mammals and their gut microbes. Science (80-) 320: 1647–1651 10.1126/science.1155725 - DOI - PMC - PubMed
    1. Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS One 6: e18011 10.1371/journal.pone.0018011 - DOI - PMC - PubMed
    1. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, et al. (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214 10.1038/nature11234 - DOI - PMC - PubMed
    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43 10.1038/nature02340 - DOI - PubMed

Publication types

LinkOut - more resources