Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;12(5):R44.
doi: 10.1186/gb-2011-12-5-r44. Epub 2011 May 19.

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Affiliations

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Christopher S Miller et al. Genome Biol. 2011.

Abstract

Recovery of ribosomal small subunit genes by assembly of short read community DNA sequence data generally fails, making taxonomic characterization difficult. Here, we solve this problem with a novel iterative method, based on the expectation maximization algorithm, that reconstructs full-length small subunit gene sequences and provides estimates of relative taxon abundances. We apply the method to natural and simulated microbial communities, and correctly recover community structure from known and previously unreported rRNA gene sequences. An implementation of the method is freely available at https://github.com/csmiller/EMIRGE.

PubMed Disclaimer

Figures

Figure 1
Figure 1
De novo assembly of SSU rRNA genes versus reconstruction of full-length gene sequences. (a) A section of the de Bruijn graph created by the short read assembler Velvet [29] for the natural microbial community. Each contig in the graph is represented by a rectangle whose width is proportional to contig length and whose height is proportional to contig k-mer coverage depth. Edge width reflects the multiplicity of overlapping k-mers shared by contigs. All contigs with BLAST matches to SSU genes recovered by EMIRGE were selected, and those contigs and additional contigs within three edges are shown. Contigs with BLAST matches to the SSU sequence from Leptospirillum ferrodiazotrophum [54] are shown in color. (b) The correct tiling of highlighted contigs from (a) is shown schematically with the EMIRGE-reconstructed SSU rRNA gene. (c) A selected region of the L. ferrodiazotrophum SSU gene shows the individual base probabilities at algorithm termination for each position in the reconstructed SSU gene. While most bases are highly confident, some positions show evidence for strain variants present in the population.
Figure 2
Figure 2
Convergence of reconstructed SSU sequences and abundance estimates. (a-d) Algorithm convergence for both the simulated simple microbial community (a, b) and natural community (c, d) is shown. In (a, c), the number of nucleotide (nt) changes made in all reconstructed SSU sequences is plotted for each iteration. In (c, d), each line represents a different reconstructed SSU sequence: the prior probability (abundance estimate) of each SSU sequence is plotted for each iteration. Only SSU sequences with ≥ 1% prior probability at convergence are shown.
Figure 3
Figure 3
Community composition is captured by correctly reconstructed full-length SSU sequences. (a, b) Phylogenetic trees showing algorithm-reconstructed sequences (black diamonds) and their best blast hits, for both the simulated simple (a) and natural (b) microbial communities. Reconstructed sequences are presented with their (arbitrary) algorithm-assigned identifier and their prior probability, which serves as an abundance estimate, after the final round. All reconstructed sequences match to the expected organism in the simulated community (a), and all but two sequences match to metagenomic contigs assembled from traditional Sanger sequencing in the natural community (b). The two novel sulfobacilli in the natural community are presented with their closest blast hit in GenBank. Units are base substitutions per site, and bootstrap values ≥ 50 are shown at the branches.
Figure 4
Figure 4
SSU abundance estimates are accurate. For the nine most abundant reconstructed sequences in the simulated simple community, the final prior probability estimated by EMIRGE is plotted against the expected SSU abundances from the associated community members. The algorithm recovers the expected abundances excellently (Pearson ρ = 0.998, P-value = 8.5e-10).
Figure 5
Figure 5
Accurate SSU sequences and abundance estimates are recovered by EMIRGE for a complex microbial community. Using reads from the complex simulated community, full-length SSU genes reconstructed by EMIRGE with estimated abundances of > 0.5% were aligned and placed in a phylogenetic tree with the expected community members. Estimated EMIRGE sequences and relative abundances (blue) correspond in most cases to expected sequences and expected abundances (red). Grey circles on branches indicate bootstrap values > 80.
Figure 6
Figure 6
Effect of sequencing library characteristics on EMIRGE performance. The effects of sequencing effort (x axis), read length, and insert size were evaluated by running EMIRGE on the complex community with varying input. Reconstructed communities were compared to the expected community with the weighted UniFrac distance metric [30]. For the varying insert size experiment, a single sequencing effort was chosen (76-bp read length; 80,000 genomic reads; see Materials and methods).
Figure 7
Figure 7
Validation of the presence of Sulfobacillus in the natural community. Fluorescent in situ hybridization with a Sulfobacillus-specific probe (red) shows that Sulfobacillus is present in the natural community, as predicted by EMIRGE. The generic DNA stain DAPI is shown in blue, and Sulfobacillus cells with both the specific probe and DAPI staining appear purple. Scale bar: 5 μm.

References

    1. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. doi: 10.1126/science.276.5313.734. - DOI - PubMed
    1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA. 2006;103:12115–12120. doi: 10.1073/pnas.0605127103. - DOI - PMC - PubMed
    1. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML. Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 2008;4:e1000255. doi: 10.1371/journal.pgen.1000255. - DOI - PMC - PubMed
    1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2010;108:4516–4522. - PMC - PubMed
    1. Lazarevic V, Whiteson K, Huse S, Hernandez D, Farinelli L, Osteras M, Schrenzel J, Francois P. Metagenomic study of the oral microbiota by Illumina high-throughput sequencing. J Microbiol Methods. 2009;79:266–271. doi: 10.1016/j.mimet.2009.09.012. - DOI - PMC - PubMed

Publication types

Substances