Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Aug 25;166(5):1103-1116.
doi: 10.1016/j.cell.2016.08.007.

Toward Accurate and Quantitative Comparative Metagenomics

Affiliations
Review

Toward Accurate and Quantitative Comparative Metagenomics

Stephen Nayfach et al. Cell. .

Abstract

Shotgun metagenomics and computational analysis are used to compare the taxonomic and functional profiles of microbial communities. Leveraging this approach to understand roles of microbes in human biology and other environments requires quantitative data summaries whose values are comparable across samples and studies. Comparability is currently hampered by the use of abundance statistics that do not estimate a meaningful parameter of the microbial community and biases introduced by experimental protocols and data-cleaning approaches. Addressing these challenges, along with improving study design, data access, metadata standardization, and analysis tools, will enable accurate comparative metagenomics. We envision a future in which microbiome studies are replicable and new metagenomes are easily and rapidly integrated with existing data. Only then can the potential of metagenomics for predictive ecological modeling, well-powered association studies, and effective microbiome medicine be fully realized.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Challenges Associated with Estimating the Composition of a Microbial Community from Shotgun DNA Sequencing
(A) A sample from a microbial community composed of four different microbial species. Colored cells (blue, red, green) indicate “known” species that have at least one genome sequence in reference databases. The green cell indicates a species that is rare within the microbial community. DNA contamination includes DNA from the host, laboratory environment, or experimental reagents. (B) DNA is extracted from the microbial cells in the sample. Extraction efficiency varies for different taxa, depending on the experimental protocol. The amount of DNA extracted per cell depends on growth rate—actively dividing cells yield more genomic DNA, which accumulates at the origin of replication. (C) Extracted DNA is broken into fragments by mechanical or enzymatic methods. Certain sequences are more likely to be breakpoints. (D) A library is prepared from DNA fragments and sequenced. DNA fragments with high or low GC% are under-represented in the sequencing reads. Typically millions of short (e.g., 150 bp) reads are generated per sample. (E) Bioinformatics quality-control steps may be performed to eliminate duplicate reads, trim low-quality bases from read ends, and remove reads from contamination sources or with low-quality scores. (F) To infer the composition of the microbial community, high-quality reads are either compared to reference sequences or assembled de novo. Reference-based classification cannot account for unknown species and overestimates the abundances of known species. Metagenomic assembly may not detect rare species and overestimates abundance of abundant species.
Figure 2
Figure 2. Parameters Used for Taxonomic and Functional Profiling
When computing the abundance of taxa and genes, it is important to think about what parameter of the underlying community one wishes to quantify. (A) A community with ten cells composed of three taxa with different subsets of four different gene families (colored arrows). Two cellular abundance parameters and four gene abundance parameters are defined by examples. (B) A comparison of gene relative abundance, average genomic copy number, and absolute abundance across three communities (top, middle, and bottom). The red gene is present at one copy per cell and has constant absolute abundance in all communities, but its relative abundance decreases with increasing genome size. The copy number of the blue gene increases with genome size, but its relative abundance is constant.
Figure 3
Figure 3. Differences in Functional Profiles due to Read Length, Library Size, and Quality Control Are Small Compared to Biological Variation
Publicly available metagenomes often differ in their library sizes, read lengths, and quality-control measures, which leads one to ask, how comparable are metagenomes from different studies? Twenty-six human gut metagenomes of varying quality were processed using different quality-control methods, and the resulting reads were used to estimate the relative abundance of KEGG Orthology Groups (KOs). We compared the variation introduced by these factors (top) with the variation observed between a large set of technical (N = 1,474), biological (N = 144), and non-replicate gut metagenomes (N = 179) from the Human Microbiome Project (Consortium, 2012) that contained at least one million reads (bottom). Trimming reads from their 5′ ends was done to simulate libraries of different read length; downsampling metagenomes by 95% was done to simulate libraries of different size; fastq-mcf (Aronesty, 2011) was used for de-duplication and quality filtering. To estimate the average genomic copy number of functional groups, reads were mapped to the integrated catalog of reference genes in the human gut microbiome (Li et al., 2014a, 2014b) using bowtie2 (Langmead and Salzberg, 2012) and normalized by the median coverage of 30 universal single-copy genes (Wu et al., 2013). The percent variation between two metagenomes was measured by the following: (1) taking the sum of absolute deviations across KOs, (2) dividing this by the total abundance of KOs in both metagenomes, and (3) multiplying this by 100.
Figure 4
Figure 4. The Presence of Duplicated Reads Is Largely a Function of Library Size and Microbial Diversity
FASTQC was used to estimate the percent of duplicated reads across 181 human gut meta-genomes from the Human Microbiome Project and compared to (A) library size and (B) species-level alpha diversity using the Shannon diversity index (Keylock, 2005). Species abundance of bacteria and archaea was estimated with mOTU (Sunagawa et al., 2013). Together, library size and Shannon diversity explain 63% of the variation in sequence duplication rates.
Figure 5
Figure 5. Growth of Shotgun Metagenome Data in the NCBI Sequence Read Archive
Cumulative size in terabases of publicly available shotgun metagenomic data in the NCBI Sequence Read Archive (SRA). Sequencing runs were identified using the SRAdb database (Zhu et al., 2013) by the following: library_source = “META-GENOMIC,” study_type = “Metagenomics,” and library_strategy = “WGS.”

Similar articles

Cited by

References

    1. Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B, et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol. 2012;8:e1002358. - PMC - PubMed
    1. Aitchison J. The Statistical Analysis of Compositional Data. Caldwell, N.J: Blackburn Press; 2003.
    1. Alivisatos AP, Blaser MJ, Brodie EL, Chun M, Dangl JL, Donohue TJ, Dorrestein PC, Gilbert JA, Green JL, Jansson JK, et al. MICROBIOME. A unified initiative to harness Earth’s microbiomes. Science. 2015;350:507–508. - PubMed
    1. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning meta-genomic contigs by coverage and composition. Nat Methods. 2014;11:1144–1146. - PubMed
    1. Ames SK, Gardner SN, Marti JM, Slezak TR, Gokhale MB, Allen JE. Using populations of human and microbial genomes for organism detection in metagenomes. Genome Res. 2015;25:1056–1067. - PMC - PubMed