Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 20:14:1176934318759299.
doi: 10.1177/1176934318759299. eCollection 2018.

SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees

Affiliations

SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees

Xiaoyu Yu et al. Evol Bioinform Online. .

Abstract

Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.

Keywords: Phylogenomics; bacterial evolution; computational algorithm; evolutionary model; oligonucleotide usage pattern.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Distribution of symmetric distances between COG-based and genome-based trees compared with the reference trees: WGS (left part of the figure) and GyrA (right part of the figure). Columns in the histograms depict numbers of trees equally distant from the reference trees. The columns containing the OUP-, Mauve-, GyrA-, and 16S rRNA-based trees are marked in the graphs, respectively. COG indicates clusters of orthologous genes; OUP, oligonucleotide usage pattern; rRNA, ribosomal RNA; WGS, whole genome sequence.
Figure 2.
Figure 2.
Topological similarity based on symmetrical distances between the trees calculated for the selected taxonomic groups by different algorithms: GyrA protein distances, 16S rRNA distances (depicted as 16S), OUP distances, whole genome sequence alignment distances (WGS), MAUVE, and CVTree. Dendrograms were constructed by Neighbour Joining algorithm based on the matrix of distances between the trees calculated by the treedist symmetric approach. (A) Bacillus, (B) corynebacteria, (C) enterobacteria, (D) lactobacilli, (E) Pseudomonas, and (F) mycobacteria. OUP indicates oligonucleotide usage pattern; rRNA, ribosomal RNA; WGS, whole genome sequence.
Figure 3.
Figure 3.
Oligonucleotide usage pattern phylogenetic tree using the Prochlorococcus marinus subspecies data set. The inferred tree clearly separated the different light-adapted strains (LL, low light; HL, high light) as reported elsewhere.
Figure 4.
Figure 4.
Plot showing the percentage of relocations of operational taxonomic units between clades in the trees inferred by OUP compared with the reference trees for the artificial data set produced by SimBac. Axis X depicts the different sample sizes of the generated sets of sequences of which each sample size contained 10 data sets. Axis Y shows the percentages of relocations in the corresponding trees. Borders of the grey area depict the maximal and minimal percentages of relocations identified for the sets of sequences of the same sample size. The average value of percentage of relocations calculated for all sets of sequences is shown by the bold line.
Figure 5.
Figure 5.
Pairwise distance plot of oligonucleotide usage pattern distances (axis X) against GyrA sequence distances (axis Y) calculated for pairs of organisms of the taxonomic group mycobacteria. Each pair of organisms on the plots is depicted by a dot. Distribution of dots fitted to 2 logistic curves reflecting different rates of genomic evolutionary changes.
Figure 6.
Figure 6.
Emission patterns of the codon-specific residues influenced by the states of the context residues. The diagrams of the emission pattern deviations were organized by location of the mutating residue at the first, second, and third codon positions. X axes depict the positions of the context residues relatively to the mutating residues. Data for the preceding and posterior 10 to 4 residues were summed up in the 2 outermost categories. Y axes depict vector distances between the global emission pattern and the patterns calculated for each category. Bandwidth depicts the values AVR ± 2.5 × STD.
Figure 7.
Figure 7.
SWPhylo Web-based user interface at http://swphylo.bi.up.ac.za/.
Figure 8.
Figure 8.
SWPhylo output graphs visualize clustering of the sampled genomes (the taxonomic group lactobacilli in this example) along different logistic curves that may reflect different rates of evolutionary changes in their genomes. (A) fitting of oligonucleotide usage pattern to protein distance distribution to 3 logistic curves. Each line represents one logistic cluster. Goodness of the fit test is reported by VG (very good), Good, Mod (moderate), Bad, and VB (very bad) notations. (B) Assignments of the tested genomes to different logistic clusters (zones). Evolution of the microorganisms may be explained by a series of evolutionary leaps (non-graduate increases in mutation rates in household proteins), the number of which corresponds to the number of intermediate zones on the plot.

Similar articles

Cited by

References

    1. Kyrpides NC, Hugenholtz P, Eisen JA, et al. Genomic encyclopedia of bacteria and archaea: sequencing a myriad of type strains. PLOS Biol. 2014;12:e1001920. - PMC - PubMed
    1. Chan CX, Ragan MA. Next-generation phylogenomics. Biol Direct. 2013;8:1–6. - PMC - PubMed
    1. Blaimer BB, Brady SG, Schultz TR, Lloyd MW, Fisher BL, Ward PS. Phylogenomic methods outperform traditional multi-locus approaches in resolving deep evolutionary history: a case study of formicine ants. BMC Evol Biol. 2015;15:1–14. - PMC - PubMed
    1. Beiko RG. Gene sharing and genome evolution: networks in trees and trees in networks. Biol Philos. 2010;25:659–673.
    1. Gori K, Suchan T, Alvarez N, Goldman N, Dessimoz C. Clustering genes of common evolutionary history. Molec Biol Evol. 2016;33:1590–1605. - PMC - PubMed

LinkOut - more resources