Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 4:5:3.
doi: 10.1186/1748-7188-5-3.

Phylogenetic comparative assembly

Affiliations

Phylogenetic comparative assembly

Peter Husemann et al. Algorithms Mol Biol. .

Abstract

Background: Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence.

Results: Here we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent. Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary.

Conclusions: Our new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same time.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Projection of a match. Projections π (m1) and π (m2) of the contigs c1 and c2 based on their matches m1 and m2. The distance d reflects the displacement of the projections.
Figure 2
Figure 2
Insertion distance. (a) An insertion in the reference genome leads to a positive distance, whereas (b) an insertion in a contig leads to a negative distance.
Figure 3
Figure 3
Contig adjacency discovery algorithm. Heuristic to compute the layout graph which shows the most promising contig adjacencies.
Figure 4
Figure 4
Phylogenetic tree. Phylogenetic tree of the employed Corynebacteria. For all species marked with an asterisk (*) the underlying contig data were available. The tree was calculated with EDGAR [18], the image was generated with PHY.FI [23].
Figure 5
Figure 5
Synteny plots. Pairwise synteny plots of the contigs of C. urealyticum and four chosen complete genomes of the Corynebacteria genus. The contigs are stacked on the vertical axis in reference order, separated by horizontal lines. The ticks below each synteny plot indicate uncovered regions.
Figure 6
Figure 6
PGA with perfect reference. C. urealyticum contig connections generated by PGA when using the finished genome as reference sequence. Here, the best result (25 TP, 31 FP) achieved in 20 runs is displayed. The contig nodes are numbered in reference order.
Figure 7
Figure 7
PGA with multiple references. The best result (25 TP, 76 FP) PGA generated in 20 runs for ordering the C. urealyticum contigs when using all other genomes as reference sequences. The contig nodes are numbered in reference order.
Figure 8
Figure 8
treecat with multiple references. C. urealyticum contig connections generated by treecat when using all other genomes as reference sequences. The contig nodes are numbered in reference order. Contigs smaller than 3.5 kb have gray nodes, repeating contigs for which at least 95% of the sequence occurs more than once on a reference genome have rectangular nodes. Edge weights are given in logarithmic scale.

Similar articles

Cited by

References

    1. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 1979;6(7):2601–2610. doi: 10.1093/nar/6.7.2601. - DOI - PMC - PubMed
    1. Anderson S. Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Res. 1981;9(13):3015–3027. doi: 10.1093/nar/9.13.3015. - DOI - PMC - PubMed
    1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24(3):133–141. - PubMed
    1. Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008;24(3):142–149. - PMC - PubMed
    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA. 1977;74(12):5463–5467. doi: 10.1073/pnas.74.12.5463. - DOI - PMC - PubMed