Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 1;30(7):701-707.
doi: 10.1038/nbt.2288.

A hybrid approach for the automated finishing of bacterial genomes

Affiliations

A hybrid approach for the automated finishing of bacterial genomes

Ali Bashir et al. Nat Biotechnol. .

Abstract

Advances in DNA sequencing technology have improved our ability to characterize most genomic diversity. However, accurate resolution of large structural events is challenging because of the short read lengths of second-generation technologies. Third-generation sequencing technologies, which can yield longer multikilobase reads, have the potential to address limitations associated with genome assembly. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at >99.9% accuracy. Complex regions with clinically relevant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 cholera reference strain, we obtained 14 scaffolds of greater than 1 kb for the experimental data and 8 scaffolds of greater than 1 kb for the simulated data, which allowed us to correct several errors in contigs assembled from the short-read data alone. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.

PubMed Disclaimer

Figures

Figure 1
Figure 1. H1 Assembly
The completely circularized chromosomes for H1. The outermost track (salmon) represents the circularized assembly with PacBio reads. The next track indicates points of Sanger validation between CDC contigs. The middle track (blue) indicates the position of CDC contigs and the innermost track (green) shows Illumina contigs greater than 100 bp. The highlighted regions correspond to the genomic positions of the rRNA operons (Figure 2), CTX (Figure 3), superintegron (Figure 4), and ICE (Supplementary Figure 12). The origin of replication is located at 2.76 Mb in Chr1 and 632kb in Chr2.
Figure 2
Figure 2. Resolution of rRNA genes
A) The locations of rRNA operons within H1 chromosome I. B) CDC contigs that flank the rRNA regions. C) Strobe reads that link two flanking regions, scaffolding over the 5–6kb repeat. D) Long reads overhanging into region allowing recalling of the rRNA repeat substructure. Here we only show the subset of reads 5kb in length that have at least 2kb of anchor sequence. E) CDC contigs internal to repeats. Note, that not all repeat regions contain the same constituent contigs.
Figure 3
Figure 3. CTX/TLC Assembly and Validation
A) Alignment of C2 Data (> 5kb) on the CTX H1 assembly. B) Strobe and C) continuous reads were used to create an initial scaffold of the contigs within the CTX/TLC region. Concordant strobe reads (with spans between 5.5–7kb) are shown over the region. C) Long reads were used to fill-in gaps/resolve tandem repeat structures; selected long reads (> 1.5kb) are shown in the region. D) Ordering and directionality of CDC contigs (colored directed blocks) and genes (small black arrows). Each CDC contig is given a different color to highlight repeated elements. E) PCR primers were designed to validate the region upstream of CTX as well as the TLC structure. F) PCR products were sequenced and mapped back to confirm the structure; a sampling of subreads (> 5kb) that aligned to the products is shown.
Figure 4
Figure 4. Superintegron Assembly
A) C2 reads B) strobes and C) continuous reads were used to scaffold and fill-in gaps across the superintegron region. The complexity of the region is highlighted by the number of CDC contigs in the region (D,E). D) The contigs scaffolded together – blue indicates positive strand mappings, purple indicates negative strand. E) Contigs that are repeated – with linkages between the repeated positions. F) Shows repeats as identified by nucmer. Note, not all repeated contigs necessarily are repeats as contigs may only be present in truncated forms.

References

    1. Chin CS, et al. The origin of the Haitian cholera outbreak strain. N Engl J Med. 2011;364:33–42. - PMC - PubMed
    1. Rasko DA, et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med. 2011;365:709–717. - PMC - PubMed
    1. Rohde H, et al. Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. N Engl J Med. 2011;365:718–724. - PubMed
    1. Ali A, et al. Recent clonal origin of cholera in Haiti. Emerging infectious diseases. 2011;17:699–701. - PMC - PubMed
    1. Chin C.-s., et al. The Origin of the Haitian Cholera Outbreak Strain. The New England journal of medicine. 2010:1–10. - PMC - PubMed

Publication types

Associated data