Toward Accurate and Quantitative Comparative Metagenomics

doi:10.1016/j.cell.2016.08.007

Review

. 2016 Aug 25;166(5):1103-1116.

doi: 10.1016/j.cell.2016.08.007.

Toward Accurate and Quantitative Comparative Metagenomics

Stephen Nayfach¹, Katherine S Pollard²

Affiliations

¹ Integrative Program in Quantitative Biology, University of California, San Francisco, CA 94158, USA; Gladstone Institutes, San Francisco, CA 94158, USA.
² Gladstone Institutes, San Francisco, CA 94158, USA; Division of Biostatistics, Institute for Human Genetics, and Institute for Computational Health Sciences, University of California, San Francisco, CA 94158, USA. Electronic address: kpollard@gladstone.ucsf.edu.

PMID: 27565341
PMCID: PMC5080976
DOI: 10.1016/j.cell.2016.08.007

Review

Toward Accurate and Quantitative Comparative Metagenomics

Stephen Nayfach et al. Cell. 2016.

. 2016 Aug 25;166(5):1103-1116.

doi: 10.1016/j.cell.2016.08.007.

Authors

Stephen Nayfach¹, Katherine S Pollard²

Affiliations

¹ Integrative Program in Quantitative Biology, University of California, San Francisco, CA 94158, USA; Gladstone Institutes, San Francisco, CA 94158, USA.
² Gladstone Institutes, San Francisco, CA 94158, USA; Division of Biostatistics, Institute for Human Genetics, and Institute for Computational Health Sciences, University of California, San Francisco, CA 94158, USA. Electronic address: kpollard@gladstone.ucsf.edu.

PMID: 27565341
PMCID: PMC5080976
DOI: 10.1016/j.cell.2016.08.007

Abstract

Shotgun metagenomics and computational analysis are used to compare the taxonomic and functional profiles of microbial communities. Leveraging this approach to understand roles of microbes in human biology and other environments requires quantitative data summaries whose values are comparable across samples and studies. Comparability is currently hampered by the use of abundance statistics that do not estimate a meaningful parameter of the microbial community and biases introduced by experimental protocols and data-cleaning approaches. Addressing these challenges, along with improving study design, data access, metadata standardization, and analysis tools, will enable accurate comparative metagenomics. We envision a future in which microbiome studies are replicable and new metagenomes are easily and rapidly integrated with existing data. Only then can the potential of metagenomics for predictive ecological modeling, well-powered association studies, and effective microbiome medicine be fully realized.

PubMed Disclaimer

Figures

**Figure 1. Challenges Associated with Estimating the Composition of a Microbial Community from Shotgun DNA Sequencing**
(A) A sample from a microbial community composed of four different microbial species. Colored cells (blue, red, green) indicate “known” species that have at least one genome sequence in reference databases. The green cell indicates a species that is rare within the microbial community. DNA contamination includes DNA from the host, laboratory environment, or experimental reagents. (B) DNA is extracted from the microbial cells in the sample. Extraction efficiency varies for different taxa, depending on the experimental protocol. The amount of DNA extracted per cell depends on growth rate—actively dividing cells yield more genomic DNA, which accumulates at the origin of replication. (C) Extracted DNA is broken into fragments by mechanical or enzymatic methods. Certain sequences are more likely to be breakpoints. (D) A library is prepared from DNA fragments and sequenced. DNA fragments with high or low GC% are under-represented in the sequencing reads. Typically millions of short (e.g., 150 bp) reads are generated per sample. (E) Bioinformatics quality-control steps may be performed to eliminate duplicate reads, trim low-quality bases from read ends, and remove reads from contamination sources or with low-quality scores. (F) To infer the composition of the microbial community, high-quality reads are either compared to reference sequences or assembled de novo. Reference-based classification cannot account for unknown species and overestimates the abundances of known species. Metagenomic assembly may not detect rare species and overestimates abundance of abundant species.

**Figure 2. Parameters Used for Taxonomic and Functional Profiling**
When computing the abundance of taxa and genes, it is important to think about what parameter of the underlying community one wishes to quantify. (A) A community with ten cells composed of three taxa with different subsets of four different gene families (colored arrows). Two cellular abundance parameters and four gene abundance parameters are defined by examples. (B) A comparison of gene relative abundance, average genomic copy number, and absolute abundance across three communities (top, middle, and bottom). The red gene is present at one copy per cell and has constant absolute abundance in all communities, but its relative abundance decreases with increasing genome size. The copy number of the blue gene increases with genome size, but its relative abundance is constant.

**Figure 3. Differences in Functional Profiles due to Read Length, Library Size, and Quality Control Are Small Compared to Biological Variation**
Publicly available metagenomes often differ in their library sizes, read lengths, and quality-control measures, which leads one to ask, how comparable are metagenomes from different studies? Twenty-six human gut metagenomes of varying quality were processed using different quality-control methods, and the resulting reads were used to estimate the relative abundance of KEGG Orthology Groups (KOs). We compared the variation introduced by these factors (top) with the variation observed between a large set of technical (N = 1,474), biological (N = 144), and non-replicate gut metagenomes (N = 179) from the Human Microbiome Project (Consortium, 2012) that contained at least one million reads (bottom). Trimming reads from their 5′ ends was done to simulate libraries of different read length; downsampling metagenomes by 95% was done to simulate libraries of different size; fastq-mcf (Aronesty, 2011) was used for de-duplication and quality filtering. To estimate the average genomic copy number of functional groups, reads were mapped to the integrated catalog of reference genes in the human gut microbiome (Li et al., 2014a, 2014b) using bowtie2 (Langmead and Salzberg, 2012) and normalized by the median coverage of 30 universal single-copy genes (Wu et al., 2013). The percent variation between two metagenomes was measured by the following: (1) taking the sum of absolute deviations across KOs, (2) dividing this by the total abundance of KOs in both metagenomes, and (3) multiplying this by 100.

**Figure 4. The Presence of Duplicated Reads Is Largely a Function of Library Size and Microbial Diversity**
FASTQC was used to estimate the percent of duplicated reads across 181 human gut meta-genomes from the Human Microbiome Project and compared to (A) library size and (B) species-level alpha diversity using the Shannon diversity index (Keylock, 2005). Species abundance of bacteria and archaea was estimated with mOTU (Sunagawa et al., 2013). Together, library size and Shannon diversity explain 63% of the variation in sequence duplication rates.

**Figure 5. Growth of Shotgun Metagenome Data in the NCBI Sequence Read Archive**
Cumulative size in terabases of publicly available shotgun metagenomic data in the NCBI Sequence Read Archive (SRA). Sequencing runs were identified using the SRAdb database (Zhu et al., 2013) by the following: library_source = “META-GENOMIC,” study_type = “Metagenomics,” and library_strategy = “WGS.”

See this image and copyright information in PMC

Cited by

AsgeneDB: a curated orthology arsenic metabolism gene database and computational tool for metagenome annotation.
Song X, Li Y, Stirling E, Zhao K, Wang B, Zhu Y, Luo Y, Xu J, Ma B. Song X, et al. NAR Genom Bioinform. 2022 Nov 1;4(4):lqac080. doi: 10.1093/nargab/lqac080. eCollection 2022 Dec. NAR Genom Bioinform. 2022. PMID: 36330044 Free PMC article.
Emergent Functional Organization of Gut Microbiomes in Health and Diseases.
Seppi M, Pasqualini J, Facchin S, Savarino EV, Suweis S. Seppi M, et al. Biomolecules. 2023 Dec 20;14(1):5. doi: 10.3390/biom14010005. Biomolecules. 2023. PMID: 38275746 Free PMC article.
Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis.
Hardwick SA, Chen WY, Wong T, Kanakamedala BS, Deveson IW, Ongley SE, Santini NS, Marcellin E, Smith MA, Nielsen LK, Lovelock CE, Neilan BA, Mercer TR. Hardwick SA, et al. Nat Commun. 2018 Aug 6;9(1):3096. doi: 10.1038/s41467-018-05555-0. Nat Commun. 2018. PMID: 30082706 Free PMC article.
Crewmember microbiome may influence microbial composition of ISS habitable surfaces.
Avila-Herrera A, Thissen J, Urbaniak C, Be NA, Smith DJ, Karouia F, Mehta S, Venkateswaran K, Jaing C. Avila-Herrera A, et al. PLoS One. 2020 Apr 29;15(4):e0231838. doi: 10.1371/journal.pone.0231838. eCollection 2020. PLoS One. 2020. PMID: 32348348 Free PMC article.
An Expanded Gene Catalog of Mouse Gut Metagenomes.
Zhu J, Ren H, Zhong H, Li X, Zou Y, Han M, Li M, Madsen L, Kristiansen K, Xiao L. Zhu J, et al. mSphere. 2021 Feb 24;6(1):e01119-20. doi: 10.1128/mSphere.01119-20. mSphere. 2021. PMID: 33627510 Free PMC article.

See all "Cited by" articles

References

1. Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B, et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol. 2012;8:e1002358. - PMC - PubMed
1. Aitchison J. The Statistical Analysis of Compositional Data. Caldwell, N.J: Blackburn Press; 2003.
1. Alivisatos AP, Blaser MJ, Brodie EL, Chun M, Dangl JL, Donohue TJ, Dorrestein PC, Gilbert JA, Green JL, Jansson JK, et al. MICROBIOME. A unified initiative to harness Earth’s microbiomes. Science. 2015;350:507–508. - PubMed
1. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning meta-genomic contigs by coverage and composition. Nat Methods. 2014;11:1144–1146. - PubMed
1. Ames SK, Gardner SN, Marti JM, Slezak TR, Gokhale MB, Allen JE. Using populations of human and microbial genomes for organism detection in metagenomes. Genome Res. 2015;25:1056–1067. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B, et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol. 2012;8:e1002358. - PMC - PubMed

[2] Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B, et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol. 2012;8:e1002358. - PMC - PubMed

[3] Aitchison J. The Statistical Analysis of Compositional Data. Caldwell, N.J: Blackburn Press; 2003.

[4] Aitchison J. The Statistical Analysis of Compositional Data. Caldwell, N.J: Blackburn Press; 2003.

[5] Alivisatos AP, Blaser MJ, Brodie EL, Chun M, Dangl JL, Donohue TJ, Dorrestein PC, Gilbert JA, Green JL, Jansson JK, et al. MICROBIOME. A unified initiative to harness Earth’s microbiomes. Science. 2015;350:507–508. - PubMed

[6] Alivisatos AP, Blaser MJ, Brodie EL, Chun M, Dangl JL, Donohue TJ, Dorrestein PC, Gilbert JA, Green JL, Jansson JK, et al. MICROBIOME. A unified initiative to harness Earth’s microbiomes. Science. 2015;350:507–508. - PubMed

[7] Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning meta-genomic contigs by coverage and composition. Nat Methods. 2014;11:1144–1146. - PubMed

[8] Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning meta-genomic contigs by coverage and composition. Nat Methods. 2014;11:1144–1146. - PubMed

[9] Ames SK, Gardner SN, Marti JM, Slezak TR, Gokhale MB, Allen JE. Using populations of human and microbial genomes for organism detection in metagenomes. Genome Res. 2015;25:1056–1067. - PMC - PubMed

[10] Ames SK, Gardner SN, Marti JM, Slezak TR, Gokhale MB, Allen JE. Using populations of human and microbial genomes for organism detection in metagenomes. Genome Res. 2015;25:1056–1067. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Toward Accurate and Quantitative Comparative Metagenomics

Affiliations

Toward Accurate and Quantitative Comparative Metagenomics

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources