Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 1;8(3):giy148.
doi: 10.1093/gigascience/giy148.

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Affiliations

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Chris-Andre Leimeister et al. Gigascience. .

Abstract

Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

Keywords: Kimura; Wolbachia; alignment-free; amino-acid substitutions; distance method; micro-alignment; phylogeny; protein comparison; proteome; spaced words.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Spaced-word histograms (spamograms) for different datasets. (A) and (B) are based on simulated insertion and deletion (indel)-free protein sequences with a total length of of 1.6 × 106 amino-acid residues each and with 0.3 (A) and 0.75 (B) substitutions per position, respectively. (C) and (D) are from a whole-proteome comparisons of plants, (C) comparing Eucalyptus grandis with Capsella rubella and b comparing Gossypium raimondii with Carica papaya.
Figure 2:
Figure 2:
Distances calculated by Prot-SpaM and four other alignment-free methods calculated for pairs of simulated protein sequences plotted against their distances calculated with the Kimura model. Error bars denote standard deviations. Note that Prot-SpaM estimates phylogenetic distances in terms of substitutions that have happened since two sequences evolved from their last common ancestor. The programs kmacs,CVTree,FFP, and ACS, by contrast, do not estimate distances in a rigorous way but rather use ad hoc measures of sequence dissimilarity that are not linear functions of the real distances. Also, the absolute values of these distance measures are rather arbitrary for these four other programs. We therefore normalized the distances calculated by kmacs, CVTree, FFP, and ACS such that they have a value of one for sequence pairs with a Kimura distance of one.
Figure 3:
Figure 3:
Distances calculated by Prot-SpaM for pairs of simulated protein sequences with a single binary pattern (m = 1, left) and with the default multiple-pattern option (m = 5, right). We performed 1,000 program runs for each value of m. The plot shows the average of the calculated distances; standard deviations are shown as error bars.
Figure 4:
Figure 4:
Reference tree for our dataset Wolbachia I (top) and tree calculated with Prot-SpaM using whole-proteome sequences of the same taxa (bottom) (see main text for details). Topological differences between the two trees are shown in red in the Prot-SpaM tree.
Figure 5:
Figure 5:
Reference tree (A) from [48] and tree calculated with Prot-SpaM with default parameters (B) for a set of 29 Escherichia coli and Shigella strains. Differences in the topologies between the two trees are marked in red.
Figure 6:
Figure 6:
Phylogenetic trees for a large set of microbial taxa studied by Lang et al. [51]. (A) Maximum-likelihood tree constructed by Lang et al. based on a super alignment of 24 selected genes. (B) Tree constructed with our approach, as described here, for 813 taxa for which the proteomes are available in GenBank. (C) Tree constructed with our approach based on the proteins corresponding to the 24 genes selected by Lang et al. (D) Tree reconstructed using our program FSWM [33] on the 841 whole-genome sequences.
Figure 7:
Figure 7:
Phylogenetic trees of plant taxa. (A) Reference tree from [50] and trees constructed with (B) the approach described here and by (C) ACS [21], (D) FFP [8], and (E)kmacs [22]. The original dataset contained 14 taxa, but only for 11 taxa could the proteomes be downloaded through GenBank. For completeness, we show the reference for all 14 taxa.
None

Similar articles

Cited by

References

    1. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–13. - PMC - PubMed
    1. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–74. - PubMed
    1. Liu L, Xi Z, Wu S, et al. .. Estimating phylogenetic trees from genome-scale data. Annals of the New York Academy of Sciences. 2015;1360:36–53. - PubMed
    1. Bininda-Emonds ORP. The evolution of supertrees. Trends in Ecology and Evolution. 2004;19:315–22. - PubMed
    1. Chor B, Horn D, Levy Y et al. .. Genomic DNA k-mer spectra: models and modalities. Genome Biology. 2009;10:R108. - PMC - PubMed

Publication types