EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Christopher S Miller¹, Brett J Baker, Brian C Thomas, Steven W Singer, Jillian F Banfield

Affiliations

PMID: 21595876
PMCID: PMC3219967
DOI: 10.1186/gb-2011-12-5-r44

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Christopher S Miller et al. Genome Biol. 2011.

. 2011;12(5):R44.

doi: 10.1186/gb-2011-12-5-r44. Epub 2011 May 19.

Authors

Christopher S Miller¹, Brett J Baker, Brian C Thomas, Steven W Singer, Jillian F Banfield

Affiliation

¹ Department of Earth and Planetary Science, University of California, Berkeley, 307 McCone Hall #4767, Berkeley, CA 94720, USA. csmiller@berkeley.edu

PMID: 21595876
PMCID: PMC3219967
DOI: 10.1186/gb-2011-12-5-r44

Abstract

Recovery of ribosomal small subunit genes by assembly of short read community DNA sequence data generally fails, making taxonomic characterization difficult. Here, we solve this problem with a novel iterative method, based on the expectation maximization algorithm, that reconstructs full-length small subunit gene sequences and provides estimates of relative taxon abundances. We apply the method to natural and simulated microbial communities, and correctly recover community structure from known and previously unreported rRNA gene sequences. An implementation of the method is freely available at https://github.com/csmiller/EMIRGE.

PubMed Disclaimer

Figures

**Figure 1**
*De novo* assembly of SSU rRNA genes versus reconstruction of full-length gene sequences. **(a)** A section of the de Bruijn graph created by the short read assembler Velvet [29] for the natural microbial community. Each contig in the graph is represented by a rectangle whose width is proportional to contig length and whose height is proportional to contig k-mer coverage depth. Edge width reflects the multiplicity of overlapping k-mers shared by contigs. All contigs with BLAST matches to SSU genes recovered by EMIRGE were selected, and those contigs and additional contigs within three edges are shown. Contigs with BLAST matches to the SSU sequence from *Leptospirillum ferrodiazotrophum* [54] are shown in color. **(b)** The correct tiling of highlighted contigs from (a) is shown schematically with the EMIRGE-reconstructed SSU rRNA gene. **(c)** A selected region of the *L. ferrodiazotrophum* SSU gene shows the individual base probabilities at algorithm termination for each position in the reconstructed SSU gene. While most bases are highly confident, some positions show evidence for strain variants present in the population.

**Figure 2**
**Convergence of reconstructed SSU sequences and abundance estimates**. **(a-d)** Algorithm convergence for both the simulated simple microbial community (a, b) and natural community (c, d) is shown. In (a, c), the number of nucleotide (nt) changes made in all reconstructed SSU sequences is plotted for each iteration. In (c, d), each line represents a different reconstructed SSU sequence: the prior probability (abundance estimate) of each SSU sequence is plotted for each iteration. Only SSU sequences with ≥ 1% prior probability at convergence are shown.

**Figure 3**
**Community composition is captured by correctly reconstructed full-length SSU sequences**. **(a, b)** Phylogenetic trees showing algorithm-reconstructed sequences (black diamonds) and their best blast hits, for both the simulated simple **(a)** and natural **(b)** microbial communities. Reconstructed sequences are presented with their (arbitrary) algorithm-assigned identifier and their prior probability, which serves as an abundance estimate, after the final round. All reconstructed sequences match to the expected organism in the simulated community (a), and all but two sequences match to metagenomic contigs assembled from traditional Sanger sequencing in the natural community (b). The two novel sulfobacilli in the natural community are presented with their closest blast hit in GenBank. Units are base substitutions per site, and bootstrap values ≥ 50 are shown at the branches.

**Figure 4**
**SSU abundance estimates are accurate**. For the nine most abundant reconstructed sequences in the simulated simple community, the final prior probability estimated by EMIRGE is plotted against the expected SSU abundances from the associated community members. The algorithm recovers the expected abundances excellently (Pearson ρ = 0.998, P-value = 8.5e-10).

**Figure 5**
**Accurate SSU sequences and abundance estimates are recovered by EMIRGE for a complex microbial community**. Using reads from the complex simulated community, full-length SSU genes reconstructed by EMIRGE with estimated abundances of > 0.5% were aligned and placed in a phylogenetic tree with the expected community members. Estimated EMIRGE sequences and relative abundances (blue) correspond in most cases to expected sequences and expected abundances (red). Grey circles on branches indicate bootstrap values > 80.

**Figure 6**
**Effect of sequencing library characteristics on EMIRGE performance**. The effects of sequencing effort (x axis), read length, and insert size were evaluated by running EMIRGE on the complex community with varying input. Reconstructed communities were compared to the expected community with the weighted UniFrac distance metric [30]. For the varying insert size experiment, a single sequencing effort was chosen (76-bp read length; 80,000 genomic reads; see Materials and methods).

**Figure 7**
**Validation of the presence of *Sulfobacillus* in the natural community**. Fluorescent *in situ* hybridization with a *Sulfobacillus*-specific probe (red) shows that *Sulfobacillus* is present in the natural community, as predicted by EMIRGE. The generic DNA stain DAPI is shown in blue, and *Sulfobacillus* cells with both the specific probe and DAPI staining appear purple. Scale bar: 5 μm.

See this image and copyright information in PMC

References

1. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. doi: 10.1126/science.276.5313.734. - DOI - PubMed
1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA. 2006;103:12115–12120. doi: 10.1073/pnas.0605127103. - DOI - PMC - PubMed
1. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML. Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet. 2008;4:e1000255. doi: 10.1371/journal.pgen.1000255. - DOI - PMC - PubMed
1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2010;108:4516–4522. - PMC - PubMed
1. Lazarevic V, Whiteson K, Huse S, Hernandez D, Farinelli L, Osteras M, Schrenzel J, Francois P. Metagenomic study of the oral microbiota by Illumina high-throughput sequencing. J Microbiol Methods. 2009;79:266–271. doi: 10.1016/j.mimet.2009.09.012. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Affiliation

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical