Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 4:7:e6984.
doi: 10.7717/peerj.6984. eCollection 2019.

CAM: an alignment-free method to recover phylogenies using codon aversion motifs

Affiliations

CAM: an alignment-free method to recover phylogenies using codon aversion motifs

Justin B Miller et al. PeerJ. .

Abstract

Background: Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an approach to quickly calculate distances between species based on codon aversion.

Methods: Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, and many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229,742,339 genes from 23,428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies.

Results: Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies.

Availability: CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam.

Keywords: Alignment-free; Codon aversion; Codon usage bias; Maximum likelihood; Phylogenetics; Phylogenomics; Phylogeny; Systematics; Taxonomy; Tree of life.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Flow charts for calculating the distance matrix and comparing the recovered phylogenies.
(A) Calculate Distance Matrix: Start with two FASTA files of the DNA coding sequences of two species. For each species, find the unused codons within each gene, alphabetize them, and make those codons into a tuple. Add the tuple to an unordered set for that species. The distance is calculated by dividing the number of tuples in the intersection of the two sets by the minimum number of tuples in the two original sets. (B) Recover and Compare Phylogenies: From the distance matrix, use neighbor-joining to recover a phylogeny. We do not use a model of evolution to compute distances because distance is a function of the number of shared codon aversion motifs within a species. This technique allows a fair comparison of diverse or unknown species. Using the compare method within the Environment for Tree Exploration (ETE3), we then compare the unrooted tree with the OTL and the NCBI taxonomy. Finally, we report the percentage of the phylogenies that overlap.
Figure 2
Figure 2. A flow chart depicting the process getOTLtree takes to infer a subtree phylogeny from the OTL.
All steps are done with a single command at runtime.

Similar articles

Cited by

References

    1. Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Briefings in Bioinformatics. 2014;15:890–905. doi: 10.1093/bib/bbt052. - DOI - PMC - PubMed
    1. Chantawannakul P, Cutler RW. Convergent host-parasite codon usage between honeybee and bee associated viral genomes. Journal of Invertebrate Pathology. 2008;98:206–210. doi: 10.1016/j.jip.2008.02.016. - DOI - PubMed
    1. Chapus C, Dufraigne C, Edwards S, Giron A, Fertil B, Deschavanne P. Exploration of phylogenetic data using a global sequence analysis method. BMC Evolutionary Biology. 2005;5:63. doi: 10.1186/1471-2148-5-63. - DOI - PMC - PubMed
    1. Crick F. Central dogma of molecular biology. Nature. 1970;227:561–563. doi: 10.1038/227561a0. - DOI - PubMed
    1. Crick FH, Barnett L, Brenner S, Watts-Tobin RJ. General nature of the genetic code for proteins. Nature. 1961;192:1227–1232. doi: 10.1038/1921227a0. - DOI - PubMed

LinkOut - more resources