Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb;19(2):336-46.
doi: 10.1101/gr.079053.108. Epub 2008 Dec 3.

De novo fragment assembly with short mate-paired reads: Does the read length matter?

Affiliations

De novo fragment assembly with short mate-paired reads: Does the read length matter?

Mark J Chaisson et al. Genome Res. 2009 Feb.

Abstract

Increasing read length is currently viewed as the crucial condition for fragment assembly with next-generation sequencing technologies. However, introducing mate-paired reads (separated by a gap of length, GapLength) opens a possibility to transform short mate-pairs into long mate-reads of length approximately GapLength, and thus raises the question as to whether the read length (as opposed to GapLength) even matters. We describe a new tool, EULER-USR, for assembling mate-paired short reads and use it to analyze the question of whether the read length matters. We further complement the ongoing experimental efforts to maximize read length by a new computational approach for increasing the effective read length. While the common practice is to trim the error-prone tails of the reads, we present an approach that substitutes trimming with error correction using repeat graphs. An important and counterintuitive implication of this result is that one may extend sequencing reactions that degrade with length "past their prime" to where the error rate grows above what is normally acceptable for fragment assembly.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The positional profile of base-calling errors for Illumina reads for 2 million 50-nt-long reads from a human BAC. The error rate across reads is shown (solid line) along with the error rate for reads with a fixed number of errors. The erroneous nucleotides in each read are detected by mapping the read to the reference genome. The high error rate in position 6 is due to the bias in our particular data set rather than a systematic problem with the Illumina technology.
Figure 2.
Figure 2.
From de Bruijn graphs to repeat graphs. The de Bruijn graph of a sequence contains a vertex for every k-mer in the sequence, and an edge (u, v) for every pair of consecutive (overlapping) k-mers in the sequence (A). The condensed de Bruijn graph replaces all paths containing nonbranching vertices by a single edge labeled by the sequence that generated the path (B). When the condensed de Bruijn graph is constructed on a genome, it contains some small bulges and whirls representing repeats with slightly varying repeat copies (C). In the repeat graph, the bulges and whirls are removed (E). The de Bruijn graph of reads contains additional spurious bulges and whirls caused by sequencing errors in reads (D). The goal of the Eulerian assembly is to construct the repeat graph of reads (F) that approximates the repeat graph of the genome. Different studies use different terminology, e.g., the edges of these graphs are referred to as “blocks” in Zerbino and Birney (2008) and “unipaths” in Butler et al. (2008).
Figure 3.
Figure 3.
Choosing the multiplicity threshold for error correction. All k-mers appearing in the reads are classified as correct if they appear in the genome, and incorrect otherwise. For a multiplicity x, let correct(x)/incorrect(x) be the number of correct/incorrect k-mers with multiplicity x (the plots are shown for 50-base long Illumina reads from a human BAC and k = 20). As expected, most high-multiplicity k-mers are correct and most low-multiplicity k-mers are incorrect. A Poisson/Gaussian mixture model was fit to the distribution of all k-mer multiplicities in order to model the process of generating incorrect (Poisson) and correct k-mers (Gaussian). To show the fit of the model, the k-mer multiplicities were generated according to the estimated parameters λ = 0.95, μ = 25, and σ = 9.38 with a mixing parameter w = 0.95. One may find the multiplicity m with good separation between correct and incorrect k-mers by estimating the first local minimum from the distribution of k-mer counts or the minimum of the sum of the probabilities of the mixture model, a more smooth distribution. For multiplicity threshold m = 5, only 0.6% of correct 20-mers have multiplicity <5 and only 0.3% of incorrect 20-mers have multiplicity ≥5.
Figure 4.
Figure 4.
(A) A fragment of a made-up repeat graph formed by three divergent copies of a repeat. There are many possible paths from readstart to readend. To transform the mate-pair, readstar -GAP of length d-readend into a mate-read readstart-SEQUENCE of length d-readend, we compute the support for every path between readstart and readend and select a path with maximum support. In this example, the “red” path P 1 has greater support than the “blue” path P 2. (B) A fragment of the real repeat graph of E. coli (constructed from ECOLI data set) illustrating that transformation of mate-pairs into mate-reads may fail in some cases. Red edges represent unique (typically long) contigs, while black edges represent repeats.
Figure 5.
Figure 5.
Mapping reads to the paths in the repeat graph. (A) A read maps to a single edge. (B) A read maps to two paths, and the closest one is chosen. (C) A read may be mapped to two similar paths implying that trimming is required.
Figure 6.
Figure 6.
Comparison of EULER-USR and Velvet using both paired reads and the same reads with mate-pair information removed (ECOLI data set). The contigs are ordered in the decreasing order of sizes and the cumulative size of x longest contigs are shown on the y-axis.
Figure 7.
Figure 7.
Comparison of EULER-USR (threading) and Velvet. In each plot, the contigs are ordered in the decreasing order of size and the cumulative size of x longest contigs are shown on the y-axis (only contigs longer than 100 bases are shown). See Table 3 for the choice of parameters of all programs in these plots. For all assemblies of the BAC, the locations of the contigs closest to lengths 5000, 2000, 1000, and 500 bases are shown with a “+” mark.
Figure 8.
Figure 8.
Statistics of assembly for various read coverage (E. coli genome). The cumulative length of contigs in order of decreasing length is shown for the 1000 longest contigs. The cumulative length of contigs of the repeat graph on the genome is shown as a dashed line.

References

    1. Barski A., Cuddapah S., Cui K., Roh T.Y., Schones D.E., Wei G., Chepelev I., Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Butler J., MacCallum I., Kleber M., Shlyakhter I.A., Belmonte M.K., Lander E.S., Nusbaum C., Jaffe D.B. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Chaisson M.J., Pevzner P.A. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–330. - PMC - PubMed
    1. Chaisson M.J., Tang H., Pevzner P.A. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed
    1. Dohm J.C., Lottaz C., Borodina T., Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007;17:1697–1706. - PMC - PubMed

Publication types

Substances

LinkOut - more resources