Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 1;46(10):e59.
doi: 10.1093/nar/gky174.

GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes

Affiliations

GraftM: a tool for scalable, phylogenetically informed classification of genes within metagenomes

Joel A Boyd et al. Nucleic Acids Res. .

Abstract

Large-scale metagenomic datasets enable the recovery of hundreds of population genomes from environmental samples. However, these genomes do not typically represent the full diversity of complex microbial communities. Gene-centric approaches can be used to gain a comprehensive view of diversity by examining each read independently, but traditional pairwise comparison approaches typically over-classify taxonomy and scale poorly with increasing metagenome and database sizes. Here we introduce GraftM, a tool that uses gene specific packages to rapidly identify gene families in metagenomic data using hidden Markov models (HMMs) or DIAMOND databases, and classifies these sequences using placement into pre-constructed gene trees. The speed and accuracy of GraftM was benchmarked with in silico and in vitro mock communities using taxonomic markers, and was found to have higher accuracy at the family level with a processing time 2.0-3.7× faster than currently available software. Exploration of a wetland metagenome using 16S rRNA- and methyl-coenzyme M reductase (McrA)-specific gpkgs revealed taxonomic and functional shifts across a depth gradient. Analysis of the NCBI nr database using the McrA gpkg allowed the detection of novel sequences belonging to phylum-level lineages. A growing collection of gpkgs is available online (https://github.com/geronimp/graftM_gpkgs), where curated packages can be uploaded and exchanged.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic of the GraftM pipeline, outlining the create, search and classify stages. Within the search step, the red arrow indicates the amino acid pipeline and the blue arrow indicates the nucleic acid pipeline.
Figure 2.
Figure 2.
(A) Recovery of true positives by the HMMsearch and DIAMOND search methods on the in silico and in vitro mock datasets. (B) Distribution of true positive reads along the gene locus, using ribosomal protein S7 used as an example. (C) False positive rate measured as false positive per true positive for each ribosomal protein and the 16S rRNA gene.
Figure 3.
Figure 3.
Classification accuracy of 150 bp reads from in silico and in vitro mock. (A) Accuracy of phylogenetic classification using the 16S rRNA gpkg. (B) Accuracy of phylogenetic classification using 15 ribosomal protein gpkgs for GraftM and metAnnotate.
Figure 4.
Figure 4.
(A) Krona plots showing the order-level classification of partial and full-length McrA sequences identified by the HMMSEARCH+pplacer and DIAMOND pipelines from the NCBI’s nr database. (B) Percentage of reads left classified at each taxonomic rank by each pipeline. (C) Maximum-likelihood tree of full length McrA sequences from the McrA gpkg and partial/full length McrA sequences identified in the NCBI nr database that were unclassified by GraftM. Support values from 100 bootstrap replicates of ≥50% are indicated as white circles, ≥75% as gray circles and ≥90% as black circles. Clades are labeled according to the lowest common ancestor of the lineages within. Red clades indicate lineages originating from the NCBI nr database that were not classified to a phyla by GraftM.
Figure 5.
Figure 5.
Shifts in microbial community structure across a permafrost active layer metagenome. (A) A heatmap using the log10 transformed relative abundance of the microbial community, clustered at the order level using the 16S rRNA gene gpkg. (B) Taxonomic composition as determined by the McrA gpkg. (C) Composition of the community as defined by manually refined version of the McrA tree, annotated with main methanogenic substrates used by each lineage.

References

    1. Tyson G.W., Chapman J., Hugenholtz P., Allen E.E., Ram R.J., Richardson P.M., Solovyev V.V., Rubin E.M., Rokhsar D.S., Banfield J.F.. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004; 428:37–43. - PubMed
    1. Albertsen M., Hugenholtz P., Skarshewski A., Nielsen K.L., Tyson G.W., Nielsen P.H.. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 2013; 31:533–538. - PubMed
    1. Wu Y.-W., Tang Y.-H., Tringe S.G., Simmons B.A., Singer S.W.. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014; 2:26. - PMC - PubMed
    1. Kang D.D., Froula J., Egan R., Wang Z.. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015; 3:e1165. - PMC - PubMed
    1. Wrighton K.C., Thomas B.C., Sharon I., Miller C.S., Castelle C.J., VerBerkmoes N.C., Wilkins M.J., Hettich R.L., Lipton M.S., Williams K.H. et al. . Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science. 2012; 337:1661–1665. - PubMed

Publication types

Substances

LinkOut - more resources