Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Apr;41(7):e75.
doi: 10.1093/nar/gkt003. Epub 2013 Jan 18.

Co-phylog: an assembly-free phylogenomic approach for closely related organisms

Affiliations

Co-phylog: an assembly-free phylogenomic approach for closely related organisms

Huiguang Yi et al. Nucleic Acids Res. 2013 Apr.

Abstract

With the advent of high-throughput sequencing technologies, the rapid generation and accumulation of large amounts of sequencing data pose an insurmountable demand for efficient algorithms for constructing whole-genome phylogenies. The existing phylogenomic methods all use assembled sequences, which are often not available owing to the difficulty of assembling short-reads; this obstructs phylogenetic investigations on species without a reference genome. In this report, we present co-phylog, an assembly-free phylogenomic approach that creates a 'micro-alignment' at each 'object' in the sequence using the 'context' of the object and calculates pairwise distances before reconstructing the phylogenetic tree based on those distances. We explored the parameters' usages and the optimal working range of co-phylog, assessed co-phylog using the simulated next-generation sequencing (NGS) data and the real NGS raw data. We also compared co-phylog method with traditional alignment and alignment-free methods and illustrated the advantages and limitations of co-phylog method. In conclusion, we demonstrated that co-phylog is efficient algorithm and that it delivers high resolution and accurate phylogenies using whole-genome unassembled sequencing data, especially in the case of closely related organisms, thereby significantly alleviating the computational burden in the genomic era.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The algorithm overview. (a) Some examples of structure S. (b) The k-tuple sets Hk,G1 and Hk,G2 that generated from genome G1 and genome G2, respectively, given a structure S = C2,2O1. (c) C-gram–O-gram pairs generated from the corresponding k-tuple sets. (d) Context–object pairs generated from the corresponding C-gram–O-gram pairs. (e) Shared Context and their corresponding objects in G1 and G2. (f) The computing of context–object distance between G1 and G2.
Figure 2.
Figure 2.
Comparisons of the alignment-based tree and the co-phylog trees constructed with different structures, on the Brucella 13 genomes. All the trees share the same organisms list. The Ochrobactrum anthropi genome is adopted as the out-group taxon.
Figure 3.
Figure 3.
(a) The benchmark tree constructed based on multiple genomes alignment and the trees constructed by the three methods, co-phylog (S = C9,9O1), CVtree and Kr, on the Escherichia/Shigella 26 genomes. The number near the node represents the bootstrap value (see Doc. S1 for details). And (b) the symmetric differences of the benchmark tree against the trees constructed by the three methods, co-phylog, CVtree and Kr. (c) Correlation analyses between the p-distance and each of the three distances, co-distance, CVtree-distance and Kr-distance. These four types of distances are generated from the pairwise comparisons of the Escherichia coli/Shigella 26 genomes, using multiple genomes alignment, co-phylog, CVtree and Kr, respectively.
Figure 4.
Figure 4.
Comparison between the 16S rDNA tree and the co-phylog tree, constructed on the Enterobacteriaceae 63 genomes. The number near the node represents the bootstrap value (see Supplementary Data for details).
Figure 5.
Figure 5.
The changing of the co-distances and the log number of the common context counts computed between two genome evolved in silico, with gradually increased evolutionary divergence (substitutions per codon), using two structures S = C9,9O1 and C12,12O1.
Figure 6.
Figure 6.
Comparison between the co-phylog tree constructed using assembled genomes of the E. coli 29 organisms and the co-phylog tree constructed using their corresponding NGS raw data. The Escherichia fergusonii genome is adopted as the out-group taxon.

References

    1. Wiens JJ. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 2003;52:528–538. - PubMed
    1. Snel B, Bork P, Huynen MA. Genome phylogeny based on gene content. Nat. Genet. 1999;21:108–110. - PubMed
    1. Blanchette M, Kunisawa T, Sankoff D. Gene order breakpoint evidence in animal mitochondrial phylogeny. J. Mol. Evol. 1999;49:193–203. - PubMed
    1. Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 2006;13:336–350. - PubMed
    1. Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J. Mol. Evol. 2004;58:1–11. - PubMed

Publication types