Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug;16(8):1101-16.
doi: 10.1089/cmb.2009.0047.

Maximum likelihood genome assembly

Affiliations

Maximum likelihood genome assembly

Paul Medvedev et al. J Comput Biol. 2009 Aug.

Abstract

Whole genome shotgun assembly is the process of taking many short sequenced segments (reads) and reconstructing the genome from which they originated. We demonstrate how the technique of bidirected network flow can be used to explicitly model the double-stranded nature of DNA for genome assembly. By combining an algorithm for the Chinese Postman Problem on bidirected graphs with the construction of a bidirected de Bruijn graph, we are able to find the shortest double-stranded DNA sequence that contains a given set of k-long DNA molecules. This is the first exact polynomial time algorithm for the assembly of a double-stranded genome. Furthermore, we propose a maximum likelihood framework for assembling the genome that is the most likely source of the reads, in lieu of the standard maximum parsimony approach (which finds the shortest genome subject to some constraints). In this setting, we give a bidirected network flow-based algorithm that, by taking advantage of high coverage, accurately estimates the copy counts of repeats in a genome. Our second algorithm combines these predicted copy counts with matepair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from Escherichia coli and predict copy counts with extremely high accuracy, while assembling long contigs.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
(A) An example of double stranded DNA. The sequence read from this DNA can be either ATTGCC or GGCAAT. (B) Three possible types of overlaps between two reads: each read can be in either of two orientations, but two of the cases (both to the left and both to the right) are symmetric. (C) The three corresponding types of bidirected edges. The left node corresponds to the lower read. Note that the arrow points into a node if and only if the overlap covers the start (5′) of the read.
FIG. 2
FIG. 2
(A) This is an example of a bidirected graph. By convention, we draw edges that are positive(negative) incident to a vertex using an out(in) arrow at that vertex. The sequence W, A, X, B, Y, C, Y, B, X, D, Z is a walk, while W, A, X, D, Z is not. (B) The corresponding edge incidence matrix, with the zero entries ommited. (C) This is the associated monotonized directed graph.
FIG. 3.
FIG. 3.
Given the k-molecule-spectrum {ATT/AAT, TGC/GCA, GCC/GGC, CCA/TGG, CAA/TTG, AAC/GTT}, the approach of Pevzner et al. (2001) builds the graph on the left, and searches for two “complementary” walks. The bidirected de Bruijn graph is on the right. One walk that includes all of the edges spells ATTGCCAAC, and its reverse walk spells GTTGGCAAT.
FIG. 4.
FIG. 4.
(A–C) These demonstrate the three cases of graph simplification described in Section 7. Case A is a chain, case B a loop attached to a chain, and case C is a split vertex. A join vertex case is symmetrical and is not shown. The three simplifications are shown to the right. In all cases, the new graph can “spell” the exact same strings as the initial graph. (D) This is a conflict node. By iterative application of cases A, B, and C, we generate a graph where all remaining vertices are of type D.
FIG. 5.
FIG. 5.
Resolution of a conflict vertex: we take a incoming and an outgoing edge from the conflict vertex v, and take a pair of reads along them, such as A and B. If these edges belong together in our assembly, then A and B must be close to each other in the source genome. Consequently their mates A′ and B′ should also be close to each other (assuming A and B have the same orientation) in the source genome. We locate A′ and B′ in our graph, and look for the shortest path between them. If the distance is less than twice the error of the insert size, we consider it a evidence for joining the corresponding edges at v.

Similar articles

Cited by

References

    1. Ahuja R.K. Magnanti T.L. Orlin J.B. Network flows: theory, algorithms, and applications. Prentice-Hall; Upper Saddle River, NJ: 1993.
    1. Appa G. Kotnyek B. A bidirected generalization of network matrices. Networks. 2006;47:185–198.
    1. Butler J. Maccallum I. Kleber M., et al. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Chaisson M. Pevzner P.A. Tang H. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed
    1. Chaisson M.J. Pevzner P.A. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–330. doi: 10.1101/gr.7088808. - DOI - PMC - PubMed

Publication types

LinkOut - more resources