An introduction to the analysis of shotgun metagenomic data

Thomas J Sharpton¹

Affiliations

PMID: 24982662
PMCID: PMC4059276
DOI: 10.3389/fpls.2014.00209

Review

An introduction to the analysis of shotgun metagenomic data

Thomas J Sharpton. Front Plant Sci. 2014.

. 2014 Jun 16:5:209.

doi: 10.3389/fpls.2014.00209. eCollection 2014.

Author

Thomas J Sharpton¹

Affiliation

¹ Department of Microbiology and Department of Statistics, Oregon State University Corvallis, OR, USA.

PMID: 24982662
PMCID: PMC4059276
DOI: 10.3389/fpls.2014.00209

Abstract

Environmental DNA sequencing has revealed the expansive biodiversity of microorganisms and clarified the relationship between host-associated microbial communities and host phenotype. Shotgun metagenomic DNA sequencing is a relatively new and powerful environmental sequencing approach that provides insight into community biodiversity and function. But, the analysis of metagenomic sequences is complicated due to the complex structure of the data. Fortunately, new tools and data resources have been developed to circumvent these complexities and allow researchers to determine which microbes are present in the community and what they might be doing. This review describes the analytical strategies and specific tools that can be applied to metagenomic data and the considerations and caveats associated with their use. Specifically, it documents how metagenomes can be analyzed to quantify community structure and diversity, assemble novel genomes, identify new taxa and genes, and determine which metabolic pathways are encoded in the community. It also discusses several methods that can be used compare metagenomes to identify taxa and functions that differentiate communities.

Keywords: bioinformatics; host–microbe interactions; metagenome; microbial diversity; microbiome; microbiota; review.

PubMed Disclaimer

Figures

**FIGURE 1**
**Common metagenomic analytical strategies**. This methodological workflow illustrates a typical metagenomic analysis. First, shotgun metagenomic data is generated from a microbial community of interest. After conducting quality control procedures, metagenomic sequences can be subject to various analyses centered on the taxonomic and functional characterization of the community (gray box). These procedures are the focus of this review. Briefly, marker gene, binning, and assembly analyses provide insight into the taxonomic or phylogenetic diversity of the community and can identify novel taxa or genomes. Metagenomes can also be subject to gene prediction and functional annotation, which can be used to characterize the biological functions associated with the community and identify novel genes. The results of these various analyses can be compared to those obtained through analysis of other metagenomes to quantify the similarity between communities, determine how community diversity scales with environmental covariates (i.e., community metadata), and identify taxa and functions that stratify communities of various types (i.e., biomarker detection).

**FIGURE 2**
**Analytical strategies to determine which taxa are present in a metagenome**. A metagenome (colored lines, left) can be subject to three general analytical strategies that ultimately produce a profile of the taxa, phylogenetic lineages, or genomes present in the community. *Marker gene analyses* involve comparing each read to a reference database of taxonomically or phylogenetically informative sequences (i.e., marker genes), using a classification algorithm to determine if the read is a homolog of a marker gene, and annotating classified reads based on their similarity across marker gene sequences. There are several methods for *binning* metagenomes, including (1) compositional binning, which uses sequence composition to classify or cluster metagenomic reads into taxonomic groups, (2) similarity binning, which classifies a read into a taxonomic or phylogenetic group based on its similarity to previously identified genes or proteins, and (3) fragment recruitment, wherein reads are aligned to nearly identical genome sequences to produce metagenomic coverage estimates of the genome. Finally, sequences can be subject to *assembly*, wherein reads that share nearly identical sequence at their ends are merged to create contigs, which can subsequently be assembled into supercontigs or complete genomes.

**FIGURE 3**
**A metagenomic functional annotation workflow**. A metagenome (colored lines, left) can be annotated by subjecting each reads to gene prediction and functional annotation. In *gene prediction*, various algorithms can be used to identify subsequences in a metagenomic read (blue line) that may encode proteins (gray bars). In some situations, coding sequences may start (arrow) or stop (asterisk) upstream or downstream the length of the read, resulting in partial gene predictions. Each predicted protein can then be subject to *functional annotation*, wherein it is compared to a database of protein families. Predicted peptides that are classified as homologs of the family are annotated with the family’s function. Conducting this analysis across all reads results in a community functional diversity profile. As discussed in the main text, there are alternative annotation strategies and variations on this general procedure.

See this image and copyright information in PMC

References

1. Abubucker S., Segata N., Goll J., Schubert A. M., Izard J., Cantarel B. L., et al. (2012). Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8:e1002358 10.1371/journal.pcbi.1002358 - DOI - PMC - PubMed
1. Acinas S. G., Marcelino L. A., Klepac-Ceraj V., Polz M. F. (2004). Divergence and redundancy of 16S rRNA sequences in genomes with multiple Rrn operons. J. Bacteriol. 186 2629–2635 10.1128/JB.186.9.2629-2635.2004 - DOI - PMC - PubMed
1. Afrasiabi C., Samad B., Dineen D., Meacham C, Sjölander K. (2013). The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification. Nucleic Acids Res. 41 W242–W248 10.1093/nar/gkt399 - DOI - PMC - PubMed
1. Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402 10.1093/nar/25.17.3389 - DOI - PMC - PubMed
1. Aminov R. I. (2011). Horizontal gene exchange in environmental microbiota. Front. Microbiol. 2:158 10.3389/fmicb.2011.00158 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An introduction to the analysis of shotgun metagenomic data

Affiliation

An introduction to the analysis of shotgun metagenomic data

Author

Affiliation

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources