Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 15:10.1038/nbt.4266.
doi: 10.1038/nbt.4266. Online ahead of print.

High-quality genome sequences of uncultured microbes by assembly of read clouds

Affiliations

High-quality genome sequences of uncultured microbes by assembly of read clouds

Alex Bishara et al. Nat Biotechnol. .

Abstract

Although shotgun metagenomic sequencing of microbiome samples enables partial reconstruction of strain-level community structure, obtaining high-quality microbial genome drafts without isolation and culture remains difficult. Here, we present an application of read clouds, short-read sequences tagged with long-range information, to microbiome samples. We present Athena, a de novo assembler that uses read clouds to improve metagenomic assemblies. We applied this approach to sequence stool samples from two healthy individuals and compared it with existing short-read and synthetic long-read metagenomic sequencing techniques. Read-cloud metagenomic sequencing and Athena assembly produced the most comprehensive individual genome drafts with high contiguity (>200-kb N50, fewer than ten contigs), even for bacteria with relatively low (20×) raw short-read-sequence coverage. We also sequenced a complex marine-sediment sample and generated 24 intermediate-quality genome drafts (>70% complete, <10% contaminated), nine of which were complete (>90% complete, <5% contaminated). Our approach allows for culture-free generation of high-quality microbial genome drafts by using a single shotgun experiment.

PubMed Disclaimer

Conflict of interest statement

Competing financial interests

S.B. is an employee and owns stock in Illumina. Shotgun sequencing products developed, marketed and/or sold by Illumina were used in this manuscript.

Figures

Figure 1
Figure 1. Overview of the read cloud shotgun sequencing and assembly approach
a) DNA is first extracted from microbiome samples and is size selected to enrich for long DNA fragments. The long fragments are then diluted and undergo sparse partitioning across more than a million droplet partitions (using, for example, the 10X Genomics Chromium library preparation platform). Degenerate amplification of these long fragments is then performed within these partitions to obtain barcoded traditional libraries -- each with a barcode unique to its partition. These libraries are then pooled and sequenced with an Illumina instrument. b) The Athena assembler uses read clouds to yield more complete drafts in which genomic repeats are also accurately placed. An example repeat that is resolved and placed by Athena is shown in orange. 1) Read clouds are first assembled with standard short-read techniques to obtain seed contigs, input reads are mapped back to these seed contigs, and read pairs that span two seed contigs are used to build a scaffold graph containing unresolvable branches. 2) At each edge, Athena proposes a much simpler subassembly problem on a pooled subset of barcoded reads informed by the scaffold graph mappings. Example short reads with red and blue barcodes are passed to a short-read assembler to perform subassembly, which yields a longer subassembled contig that disambiguates branches in the scaffold graph. 3) The resulting subassembled contigs, together with the initial seed contigs, are then passed as reads to the long read De Bruijn graph based assembler Flye for final assembly. The resulting draft assembly metagenome produces more complete and more contiguous drafts in which repeats are also assembled and correctly placed.
Figure 2
Figure 2. Composition of stool microbiome communities from two healthy human participants.
a, b) Relative abundances of genera as determined by short-read classification for each of the three libraries from samples P1 and P2. The relative representation of genera appears fairly concordant between the three different library preparation methods (read cloud, SLR, short read) for each sample. Sample P1 is more diverse than sample P2 at the genus-level.c, d) Comparisons of genome draft contiguity, as measured by N50, for taxa that were present in samples P1 and P2. The read cloud approach results in a larger number of more contiguous genome drafts than the short read or SLR approaches. Results are only displayed for the largest bin of each taxon determined to be present. The completeness and contamination of genome drafts for these taxa was determined by assessing the presence of lineage-specific single copy core genes as predicted by checkM. Genome drafts were designated as incomplete (‘x’, <90% completeness), complete (circle, >90% completeness and <5% contamination), high quality (triangle, complete and with at least 18 tRNAs, as well as at least one of each of the 5S, 16S, and 23S rRNA genes). Read cloud sequencing and assembly produces many high-quality and complete drafts. The read cloud drafts are much more contiguous as compared to those obtained from SLR and short read sequencing.
Figure 3
Figure 3. Combined genome draft results of read cloud, SLR, and short read approaches applied to healthy human stool samples.
Under various performance metrics, read clouds (gold) consistently display superior performance in their ability to produce many complete and high-quality genome drafts as compared to either SLRs (blue) or short reads (green) approaches. Performance was also superior even in low short read coverage regimes (defined as <50x coverage). Counts include all complete/high-quality genome bins for all taxa in each approach. a) Number of complete genome bins (>90% completeness, <5% contamination) with a minimum N50. b) Number of complete genome bins with a minimum short read coverage depth. Genome bins with lower short read coverage correspond to less abundant organisms. c) Number of complete genome bins with an N50 of >200kb and a minimum short read coverage depth. d) Number of high-quality genome bins (complete and with at least 18 tRNAs, as well as at least one instance each of the 5S, 16S, and 23S rRNA genes) with a minimum N50. e) Number of high-quality genome bins with a minimum short read coverage depth. f) Number of high-quality genome bins with an N50 of >200kb and a minimum short read coverage depth.
Figure 4
Figure 4. Completeness of genome bins produced by read cloud, SLR, and short read sequencing for various taxa present in healthy human stool samples.
Read clouds (gold) consistently yield more complete and high-quality genome drafts for taxa within singleton bins, as compared to SLR (blue) and short read sequencing (green), both of which split sequence contigs from single genomes into two or more genome bins. Taxa are only shown if represented in at least two approaches and at least one approach produced a complete bin. a) Counts of the number of bins containing sequence for each taxon for each of the three approaches. Read clouds produced the most singleton bins for the taxa considered. b) Counts of complete and high-quality drafts for each approach. Read clouds produced the most complete genome drafts in singleton bins with 14. Ten of the 14 singleton bin complete genome drafts were designated as high quality. c) For each approach, the total number of genome bins annotated as belonging to a particular taxon. The largest bin produced by an approach for a particular taxon is designated as a incomplete (‘x’), complete (circle), or high-quality (triangle) genome draft. For nearly all taxa that received a complete or high-quality genome draft from a particular approach, only a single genome bin was annotated as belonging to these taxa. However, for some taxa, such as Escherichia coli and Clostridiales bacterium, these complete or high-quality genome drafts were accompanied by a few much smaller incomplete bins that were also annotated as belonging to these taxa. d) Counts of the number of genes present in the largest bin for a particular taxon and approach. The read cloud approach yields the bins containing the largest number of genes for the majority of taxa. The SLR bin annotated as Bacteroides uniformis in sample P1 contains more genes, but was determined to be 15% contaminated. This suggests that such some of these genes assigned to the SLR bin for Bacteroides uniformis are likely from other organisms.
Figure 5
Figure 5. Comparisons of representative read cloud genome drafts to reference genomes, and corresponding short read and SLR drafts.
Dot-plot alignments between read cloud drafts (y-axis) and the closest available reference genome (x-axis) are shown. For each dot-plot, a given color corresponds to the alignment of a single contig in the read cloud draft against the available reference. Large-scale structural concordance and also differences including inversions are visually apparent. Alignments of SLR and short read drafts to the read cloud drafts for each taxon are also shown. In all cases, read cloud drafts were the most contiguous. For each approach, contigs belonging to the largest genome bin for a particular taxa are given a darker color, and the rest of the contigs in other bins are represented with a lighter color.
Figure 6
Figure 6. Comparison of marine sediment genome drafts generated by read cloud sequencing with standard short-read vs. Athena assembly.
Athena read cloud assembly (gold) consistently produced more genome drafts than standard short-read assembly (blue) with genome bins assessed as genome drafts under various quality criteria. Athena read cloud assembly allowed significantly more 16S rRNA (16S) taxonomic sequences to be assigned to genome drafts than short-read assembly. The number of a) intermediate-quality (>70% completeness and <10% contamination) genome drafts b) intermediate-quality genome drafts with assembled 16S rRNA sequences, and c) high-quality genome drafts with assembled 16S rRNA sequences with a minimum short read coverage depth are shown.

References

    1. Schloss PD & Handelsman J Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biol. 6, 229 (2005). - PMC - PubMed
    1. Turnbaugh PJ et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031 (2006). - PubMed
    1. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). - PMC - PubMed
    1. Lloyd-Price J et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature (2017). doi:10.1038/nature23889 - DOI - PMC - PubMed
    1. Kashtan N et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014). - PubMed