Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

doi:10.1093/gigascience/giy148

. 2019 Mar 1;8(3):giy148.

doi: 10.1093/gigascience/giy148.

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Chris-Andre Leimeister¹, Jendrik Schellhorn¹, Svenja Dörrer¹, Michael Gerth², Christoph Bleidorn^{3

4}, Burkhard Morgenstern^{1

5}

Affiliations

¹ University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany.
² Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK.
³ University of Göttingen, Department of Animal Evolution and Biodiversity, Untere Karspüle 2, 37073 Göttingen, Germany.
⁴ Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain.
⁵ Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen.

PMID: 30535314
PMCID: PMC6436989
DOI: 10.1093/gigascience/giy148

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Chris-Andre Leimeister et al. Gigascience. 2019.

. 2019 Mar 1;8(3):giy148.

doi: 10.1093/gigascience/giy148.

Authors

Chris-Andre Leimeister¹, Jendrik Schellhorn¹, Svenja Dörrer¹, Michael Gerth², Christoph Bleidorn^{3

4}, Burkhard Morgenstern^{1

5}

Affiliations

¹ University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany.
² Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK.
³ University of Göttingen, Department of Animal Evolution and Biodiversity, Untere Karspüle 2, 37073 Göttingen, Germany.
⁴ Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain.
⁵ Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen.

PMID: 30535314
PMCID: PMC6436989
DOI: 10.1093/gigascience/giy148

Abstract

Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

Keywords: Kimura; Wolbachia; alignment-free; amino-acid substitutions; distance method; micro-alignment; phylogeny; protein comparison; proteome; spaced words.

PubMed Disclaimer

Figures

**Figure 1:**
Spaced-word histograms (spamograms) for different datasets. **(A)** and **(B)** are based on simulated insertion and deletion (indel)-free protein sequences with a total length of of 1.6 × 10⁶ amino-acid residues each and with 0.3 **(A)** and 0.75 **(B)** substitutions per position, respectively. **(C)** and **(D)** are from a whole-proteome comparisons of plants, **(C)** comparing *Eucalyptus grandis* with *Capsella rubella* and b comparing *Gossypium raimondii* with *Carica papaya*.

**Figure 2:**
Distances calculated by Prot-SpaM and four other alignment-free methods calculated for pairs of simulated protein sequences plotted against their distances calculated with the Kimura model. Error bars denote standard deviations. Note that Prot-SpaM estimates phylogenetic distances in terms of substitutions that have happened since two sequences evolved from their last common ancestor. The programs kmacs,CVTree,FFP, and ACS, by contrast, do not estimate distances in a rigorous way but rather use *ad hoc* measures of sequence dissimilarity that are not linear functions of the real distances. Also, the absolute values of these distance measures are rather arbitrary for these four other programs. We therefore normalized the distances calculated by kmacs, CVTree, FFP, and ACS such that they have a value of one for sequence pairs with a *Kimura* distance of one.

**Figure 3:**
Distances calculated by Prot-SpaM for pairs of simulated protein sequences with a single binary pattern (m = 1, left) and with the default multiple-pattern option (m = 5, right). We performed 1,000 program runs for each value of m. The plot shows the average of the calculated distances; standard deviations are shown as error bars.

**Figure 4:**
Reference tree for our dataset *Wolbachia I* (top) and tree calculated with Prot-SpaM using whole-proteome sequences of the same taxa (bottom) (see main text for details). Topological differences between the two trees are shown in red in the Prot-SpaM tree.

**Figure 5:**
Reference tree **(A)** from [48] and tree calculated with Prot-SpaM with default parameters **(B)** for a set of 29 *Escherichia coli* and *Shigella* strains. Differences in the topologies between the two trees are marked in red.

**Figure 6:**
Phylogenetic trees for a large set of microbial taxa studied by Lang et al. [51]. **(A)** Maximum-likelihood tree constructed by Lang et al. based on a super alignment of 24 selected genes. **(B)** Tree constructed with our approach, as described here, for 813 taxa for which the proteomes are available in GenBank. **(C)** Tree constructed with our approach based on the proteins corresponding to the 24 genes selected by Lang et al. **(D)** Tree reconstructed using our program FSWM [33] on the 841 whole-genome sequences.

**Figure 7:**
Phylogenetic trees of plant taxa. **(A)** Reference tree from [50] and trees constructed with **(B)** the approach described here and by **(C)** ACS [21], **(D)** FFP [8], and **(E)**kmacs [22]. The original dataset contained 14 taxa, but only for 11 taxa could the proteomes be downloaded through GenBank. For completeness, we show the reference for all 14 taxa.

See this image and copyright information in PMC

Cited by

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.
Silva JM, Qi W, Pinho AJ, Pratas D. Silva JM, et al. Gigascience. 2022 Dec 28;12:giad101. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13. Gigascience. 2022. PMID: 38091509 Free PMC article.
SWeeP: representing large biological sequences datasets in compact vectors.
De Pierri CR, Voyceik R, Santos de Mattos LGC, Kulik MG, Camargo JO, Repula de Oliveira AM, de Lima Nichio BT, Marchaukoski JN, da Silva Filho AC, Guizelini D, Ortega JM, Pedrosa FO, Raittz RT. De Pierri CR, et al. Sci Rep. 2020 Jan 9;10(1):91. doi: 10.1038/s41598-019-55627-4. Sci Rep. 2020. PMID: 31919449 Free PMC article.
Evolutionary Insight into the Trypanosomatidae Using Alignment-Free Phylogenomics of the Kinetoplast.
Kaufer A, Stark D, Ellis J. Kaufer A, et al. Pathogens. 2019 Sep 18;8(3):157. doi: 10.3390/pathogens8030157. Pathogens. 2019. PMID: 31540520 Free PMC article.
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances.
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. Röhling S, et al. PLoS One. 2020 Feb 10;15(2):e0228070. doi: 10.1371/journal.pone.0228070. eCollection 2020. PLoS One. 2020. PMID: 32040534 Free PMC article.
CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model.
Wang T, Yu ZG, Li J. Wang T, et al. Front Microbiol. 2024 Mar 20;15:1339156. doi: 10.3389/fmicb.2024.1339156. eCollection 2024. Front Microbiol. 2024. PMID: 38572227 Free PMC article.

See all "Cited by" articles

References

1. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–13. - PMC - PubMed
1. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–74. - PubMed
1. Liu L, Xi Z, Wu S, et al. .. Estimating phylogenetic trees from genome-scale data. Annals of the New York Academy of Sciences. 2015;1360:36–53. - PubMed
1. Bininda-Emonds ORP. The evolution of supertrees. Trends in Ecology and Evolution. 2004;19:315–22. - PubMed
1. Chor B, Horn D, Levy Y et al. .. Genomic DNA k-mer spectra: models and modalities. Genome Biology. 2009;10:R108. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–13. - PMC - PubMed

[2] Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–13. - PMC - PubMed

[3] Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–74. - PubMed

[4] Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–74. - PubMed

[5] Liu L, Xi Z, Wu S, et al. .. Estimating phylogenetic trees from genome-scale data. Annals of the New York Academy of Sciences. 2015;1360:36–53. - PubMed

[6] Liu L, Xi Z, Wu S, et al. .. Estimating phylogenetic trees from genome-scale data. Annals of the New York Academy of Sciences. 2015;1360:36–53. - PubMed

[7] Bininda-Emonds ORP. The evolution of supertrees. Trends in Ecology and Evolution. 2004;19:315–22. - PubMed

[8] Bininda-Emonds ORP. The evolution of supertrees. Trends in Ecology and Evolution. 2004;19:315–22. - PubMed

[9] Chor B, Horn D, Levy Y et al. .. Genomic DNA k-mer spectra: models and modalities. Genome Biology. 2009;10:R108. - PMC - PubMed

[10] Chor B, Horn D, Levy Y et al. .. Genomic DNA k-mer spectra: models and modalities. Genome Biology. 2009;10:R108. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Affiliations

Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases