Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 May;18(5):802-9.
doi: 10.1101/gr.072033.107. Epub 2008 Mar 10.

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

Affiliations

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

David Hernandez et al. Genome Res. 2008 May.

Abstract

Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Mapping of the contigs on the reference Staphylococcus aureus MW2 genome. (A) From external to internal, the circles correspond to the contigs produced by (1) Edena strict, (2) Velvet, (3) Edena nonstrict, (4) SSAKE, and (5) SHARCGS. The contigs are colored by alternating two different colors, which allows distinguishing contig boundaries. The last inner circle shows the coding sequences. The gaps in the Edena nonstrict assembly correspond to large misassembled contigs that did not properly map the reference genome. (B) The magnification of the region around the origin of replication provides a better view to compare the contigs length and layout between the different assembly methods. It can be seen that the contigs assembled by Edena and Velvet are long enough to reveal entire genes. More importantly, significant overlaps exist between the contigs assembled by the two programs, which also means that even larger contigs could be assembled by merging both approaches. The position of the SSCmec cassette of type IV.1 (Chongtrakool et al. 2006) is indicated by the red line.
Figure 2.
Figure 2.
Removing transitive edges. A read r1 with 13 other reads (r2 . . . r14) that overlap on its right end side are shown in the form of a multiple alignment. The overlaps that do not correspond to transitive edges are indicated with a black dot. The transitive edges removal procedure consists in discarding the overlaps that are already overlapped by another read involved in a larger overlap with r1. For example, the reads r4, r6, r7, r10, r11, r13, and r14 are overlapped by r2; they are therefore removed from the set of overlapping reads of r1. Same principle is applied to the reads r3, r5, and r8. This example is issued from a real data set of reads of 26 bases.
Figure 3.
Figure 3.
Removing short dead-end paths. (A) Possible path elongations from the right end of the read r1 are represented by a tree. Nodes that are removed are dashed. Each path leaving a branching node (shown in gray) is tested for the minimum depth it can initiate. If the required depth of md cannot be reached, then the nodes forming the dead-end path are removed. (B) Multiple sequence alignment of the reads belonging to the possible right end elongation of the read r1 is shown. The residues that do not agree with the consensus sequence are shaded. On the right side is indicated the depth value that can be reached by continuing the elongation from the corresponding read. The reads containing one or more mismatched residues have a low or a null depth value, indicating that no exact overlap exists for their right end in the entire reads data set. These reads are likely to contain sequencing errors.
Figure 4.
Figure 4.
Fixing bubbles. This illustration shows a bubble caused by a polymorphism. This example is one of the many that can be found in the overlap graph constructed from the Staphylococcus aureus strain MW2 reads data set. (A) The 24 reads implicated in the bubble are shown. r1 and r24 are the ends of the bubble, which is 35 × 2 + 1 bp in length. Reads showing the polymorphism are r11 to r15. None of these reads have exact occurrence in the published genome of S. aureus strain MW2 sequence. (B) The corresponding transitively reduced overlap graph is shown. By considering the read redundancy, the total number of reads in the low and highly covered side is five and 27, respectively. Fixing of bubbles consists in removing nodes forming the less covered side of the bubble.

References

    1. Audic S., Robert C., Campagna B., Parinello H., Claverie J.M., Raoult D., Drancourt M., Robert C., Campagna B., Parinello H., Claverie J.M., Raoult D., Drancourt M., Campagna B., Parinello H., Claverie J.M., Raoult D., Drancourt M., Parinello H., Claverie J.M., Raoult D., Drancourt M., Claverie J.M., Raoult D., Drancourt M., Raoult D., Drancourt M., Drancourt M. Genome analysis of Minibacterium massiliensis highlights the convergent evolution of water-living bacteria. PLoS Genet. 2007;3:e138. doi: 10.1371/journal.pgen.0030138. - DOI - PMC - PubMed
    1. Baba T., Takeuchi F., Kuroda M., Yuzawa H., Aoki K., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Takeuchi F., Kuroda M., Yuzawa H., Aoki K., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Kuroda M., Yuzawa H., Aoki K., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Yuzawa H., Aoki K., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Aoki K., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Oguchi A., Nagai Y., Iwama N., Asano K., Naimi T., Nagai Y., Iwama N., Asano K., Naimi T., Iwama N., Asano K., Naimi T., Asano K., Naimi T., Naimi T., et al. Genome and virulence determinants of high virulence community-acquired MRSA. Lancet. 2002;359:1819–1827. - PubMed
    1. Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
    1. Brenner S., Johnson M., Bridgham J., Golda G., Lloyd D.H., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Johnson M., Bridgham J., Golda G., Lloyd D.H., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Bridgham J., Golda G., Lloyd D.H., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Golda G., Lloyd D.H., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Lloyd D.H., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Johnson D., Luo S.J., McCurdy S., Foy M., Ewan M., Luo S.J., McCurdy S., Foy M., Ewan M., McCurdy S., Foy M., Ewan M., Foy M., Ewan M., Ewan M., et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 2000;18:630–634. - PubMed
    1. Chongtrakool P., Ito T., Ma X.X., Kondo Y., Trakulsomboon S., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Ito T., Ma X.X., Kondo Y., Trakulsomboon S., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Ma X.X., Kondo Y., Trakulsomboon S., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Kondo Y., Trakulsomboon S., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Trakulsomboon S., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Tiensasitorn C., Chavalit T., Song J.H., Hiramatsu K., Chavalit T., Song J.H., Hiramatsu K., Song J.H., Hiramatsu K., Hiramatsu K. Staphylococcal cassette chromosome mec (SCCmec) typing of methicillin-resistant Staphylococcus aureus strains isolated in 11 Asian countries: A proposal for a new nomenclature for SCCmec elements. Antimicrob. Agents Chemother. 2006;50:1001–1012. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources