Clustering Genes of Common Evolutionary History

doi:10.1093/molbev/msw038

. 2016 Jun;33(6):1590-605.

doi: 10.1093/molbev/msw038. Epub 2016 Feb 17.

Clustering Genes of Common Evolutionary History

Kevin Gori¹, Tomasz Suchan², Nadir Alvarez², Nick Goldman³, Christophe Dessimoz⁴

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom.
² Department of Ecology and Evolution, Biophore Building, UNIL-Sorge, University of Lausanne, Lausanne, Switzerland.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom goldman@ebi.ac.uk c.dessimoz@ucl.ac.uk.
⁴ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom Department of Ecology and Evolution, Biophore Building, UNIL-Sorge, University of Lausanne, Lausanne, Switzerland Department of Genetics, Evolution & Environment, University College London, London, United Kingdom Department of Computer Science, University College London, London, United Kingdom Centre for Integrative Genomics, University of Lausanne, Lausanne, Switzerland Swiss Institute of Bioinformatics, Biophore, Lausanne, Switzerland goldman@ebi.ac.uk c.dessimoz@ucl.ac.uk.

PMID: 26893301
PMCID: PMC4868114
DOI: 10.1093/molbev/msw038

Clustering Genes of Common Evolutionary History

Kevin Gori et al. Mol Biol Evol. 2016 Jun.

. 2016 Jun;33(6):1590-605.

doi: 10.1093/molbev/msw038. Epub 2016 Feb 17.

Authors

Kevin Gori¹, Tomasz Suchan², Nadir Alvarez², Nick Goldman³, Christophe Dessimoz⁴

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom.
² Department of Ecology and Evolution, Biophore Building, UNIL-Sorge, University of Lausanne, Lausanne, Switzerland.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom goldman@ebi.ac.uk c.dessimoz@ucl.ac.uk.
⁴ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Campus, Hinxton, United Kingdom Department of Ecology and Evolution, Biophore Building, UNIL-Sorge, University of Lausanne, Lausanne, Switzerland Department of Genetics, Evolution & Environment, University College London, London, United Kingdom Department of Computer Science, University College London, London, United Kingdom Centre for Integrative Genomics, University of Lausanne, Lausanne, Switzerland Swiss Institute of Bioinformatics, Biophore, Lausanne, Switzerland goldman@ebi.ac.uk c.dessimoz@ucl.ac.uk.

PMID: 26893301
PMCID: PMC4868114
DOI: 10.1093/molbev/msw038

Abstract

Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent-due to events such as incomplete lineage sorting or horizontal gene transfer-it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modeling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such "process-agnostic" approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the optimal number of clusters is poorly understood. Here, we perform a large-scale simulation study of phylogenetic distances and clustering methods to infer loci of common evolutionary history. We observe that the best-performing combinations are distances accounting for branch lengths followed by spectral clustering or Ward's method. We also introduce two statistical tests to infer the optimal number of clusters and show that they strongly outperform the silhouette criterion, a general-purpose heuristic. We illustrate the usefulness of the approach by 1) identifying errors in a previous phylogenetic analysis of yeast species and 2) identifying topological incongruence among newly sequenced loci of the globeflower fly genus Chiastocheta We release treeCl, a new program to cluster genes of common evolutionary history (http://git.io/treeCl).

Keywords: clustering; incomplete lineage sorting.; incongruence; nonorthology; phylogeny; process-agnostic.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1. — **Fig. 1.**
Overview of the clustering process. From left to right: input alignments are read; trees are inferred from the alignments; intertree distances are computed and used as the basis for clustering. Further procedures are used to re-estimate one tree for each cluster and to choose the optimal number of clusters—see text for details.

F<sc>ig</sc>. 2. — **Fig. 2.**
The relative performances of combinations of distance metric (varying over columns of panels) and clustering methods (shown by the colors of the lines), as measured by the variation of information metric (y-axes; higher values show a larger departure from the correct solution). Lines show the mean value obtained from 1,000 replicates, and the error bars show the standard error of the mean. Rows correspond to the experiments with a partition of uniformly sized clusters (A–C) and those with a partition of clusters of skewed sizes (D–F). In each individual panel, the x-axis represents the number of NNI rearrangements separating the underlying clusters, so that increasing values along this axis correlate with the clustering problem becoming easier.

F<sc>ig</sc>. 3. — **Fig. 3.**
Comparison of the criteria used to determine the number of clusters on a single problem instance—in this example, data simulated for 60 loci belonging to 4 clusters, each of size 15, with the clusters’ trees separated by 1 SPR. As the proposed number of clusters increases, the likelihood increases, which is expected because of the greater number of free parameters in the model. (A) Permutation test: the improvement in likelihood for each additional cluster (red curve) is significantly greater than that observed for permuted data sets (green dots show the distribution of values over 100 permutations) until the comparison between four and five clusters is reached, correctly implying that the use of four clusters is optimal. (B) Parametric bootstrap test: again, the improvement for each additional cluster (red curve) is significantly greater than that for data sets simulated for one fewer cluster (blue dots) until the true number of clusters (four) has been reached. (C) Silhouette score: the general-purpose silhouette stopping criterion has its maximum at the true value of 4. We note that in this instance, comprising a single data set from one simulation design, the three methods agree on the true answer.

F<sc>ig</sc>. 4. — **Fig. 4.**
Aggregate results for 400 difficult problem instances (left) and 400 moderate instances (right). The true number of clusters is 4. In both sets, our new stopping criteria (permutation and bootstrap) perform better than the general-purpose silhouette method.

F<sc>ig</sc>. 5. — **Fig. 5.**
(A) Distance of the spectral clustering of geodesic distances from the “true” clustering for varying levels of taxon occupancy. Just as with complete groups, partial groups converge to the correct assignment as the distance between clusters increases. When clusters differ from the underlying species tree by three SPRs or more, the effect of incomplete occupancy on performance is very slight. (B) Effect of incomplete taxon occupancy on cluster number selection criteria. Nonparametric permutation and parametric bootstrap recover the true number of clusters (four) in more than 90% of cases. The clusters were separated by three SPRs, and each locus had 40% mean taxon occupancy, which corresponds to the point on panel (A) indicated by the gray arrow.

F<sc>ig</sc>. 6. — **Fig. 6.**
Phylogenetic trees inferred from the three clusters found in the yeast analysis with treeCl. The tree on the left is that inferred from the largest cluster of 307 loci. This matches the established species tree for these 18 species of yeast. The taxa highlighted in red (*Saccharomyces kudriavzevii*) and blue (*Saccharomyces kluyveri*) are those that are found on long branches in the trees inferred from clusters 2 and 3 (shown respectively right, upper, and right, lower). In these trees, the branches leading to *S. kudriavzevii* (in cluster 2) and *S. kluyveri* (in cluster 3) have been truncated so as to fit reasonably on the plot. Their full lengths are as indicated. Otherwise, branch lengths can be determined by the scale bars shown (all equal scales). Branch support measures were calculated using approximate Bayes (aBayes). Where aBayes branch supports are less than the maximum possible value of 100%, their values are indicated by a number to the right of the branch.

F<sc>ig</sc>. 7. — **Fig. 7.**
Visualization of application of treeCl to the yeast data set. The scatterplot shows the embedding, by MDS, of the geodesic distances between the 344 trees. Three clusters were found by spectral clustering: red circles indicate the largest cluster, with 307 members; the 37 remaining loci are indicated by blue triangles (cluster 2) and green squares (cluster 3). Loci belonging to the first, largest cluster are tightly grouped and yield the correct species phylogeny, whereas trees belonging to the second and third clusters are disparate and all have odd and inconsistent phylogenies as a result of incorrectly called orthology (see text for full details).

F<sc>ig</sc>. 8. — **Fig. 8.**
Likelihood improvement gained when partitioning the *Chiastocheta* data into increasing numbers of clusters (red points). Resampled distributions (boxplots) were generated using the permutation procedure. The number of clusters selected by the stopping criterion is indicated by the vertical dashed line. For two to eight clusters, the improvement is statistically significant; increasing to nine clusters is not.

F<sc>ig</sc>. 9. — **Fig. 9.**
Trees obtained when clustering RAD-seq data from globeflower flies of the genus *Chiastocheta*. The trees are drawn to scale, and are rooted at their midpoint, as the outgroup is unknown. Leaves are colored according to species membership. Branch support is indicated as follows: branches with support values below 0.9 are collapsed into multifurcations; those with support in the range 0.9–0.95 are colored gray; those with support >0.95 are colored black. Support values are calculated using approximate Bayes (Anisimova et al. 2011).

See this image and copyright information in PMC

Cited by

DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics.
Gauthier J, Mouden C, Suchan T, Alvarez N, Arrigo N, Riou C, Lemaitre C, Peterlongo P. Gauthier J, et al. PeerJ. 2020 Jun 10;8:e9291. doi: 10.7717/peerj.9291. eCollection 2020. PeerJ. 2020. PMID: 32566401 Free PMC article.
Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees.
Nye TMW, Tang X, Weyenberg G, Yoshida R. Nye TMW, et al. Biometrika. 2017 Dec;104(4):901-922. doi: 10.1093/biomet/asx047. Epub 2017 Sep 27. Biometrika. 2017. PMID: 29422694 Free PMC article.
A Semi-Automated SNP-Based Approach for Contaminant Identification in Biparental Polyploid Populations of Tropical Forage Grasses.
Martins FB, Moraes ACL, Aono AH, Ferreira RCU, Chiari L, Simeão RM, Barrios SCL, Santos MF, Jank L, do Valle CB, Vigna BBZ, de Souza AP. Martins FB, et al. Front Plant Sci. 2021 Oct 22;12:737919. doi: 10.3389/fpls.2021.737919. eCollection 2021. Front Plant Sci. 2021. PMID: 34745171 Free PMC article.
SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees.
Yu X, Reva ON. Yu X, et al. Evol Bioinform Online. 2018 Feb 20;14:1176934318759299. doi: 10.1177/1176934318759299. eCollection 2018. Evol Bioinform Online. 2018. PMID: 29511354 Free PMC article.
UNSUPERVISED CLUSTERING OF AIRWAY TREE STRUCTURES ON HIGH-RESOLUTION CT: THE MESA LUNG STUDY.
Wysoczanski A, Angelini ED, Smith BM, Hoffman EA, Hiura GT, Sun Y, Barr RG, Laine AF. Wysoczanski A, et al. Proc IEEE Int Symp Biomed Imaging. 2021 Apr;2021:1568-1572. doi: 10.1109/isbi48211.2021.9434172. Epub 2021 May 25. Proc IEEE Int Symp Biomed Imaging. 2021. PMID: 39399779 Free PMC article.

See all "Cited by" articles

References

1. Abby SS, Tannier E, Gouy M, Daubin V. 2010. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11:324. - PMC - PubMed
1. Akaike H. 1974. A new look at the statistical model identification. IEEE Trans Automat Control. 19:716–723.
1. Ané C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 24:412–426. - PubMed
1. Anisimova M, Gil M, Dufayard JF, Dessimoz C, Gascuel O. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol. 60:685–699. - PMC - PubMed
1. Antoniak CE. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat. 2:1152–1174.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database

[1] Abby SS, Tannier E, Gouy M, Daubin V. 2010. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11:324. - PMC - PubMed

[2] Abby SS, Tannier E, Gouy M, Daubin V. 2010. Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11:324. - PMC - PubMed

[3] Akaike H. 1974. A new look at the statistical model identification. IEEE Trans Automat Control. 19:716–723.

[4] Akaike H. 1974. A new look at the statistical model identification. IEEE Trans Automat Control. 19:716–723.

[5] Ané C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 24:412–426. - PubMed

[6] Ané C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 24:412–426. - PubMed

[7] Anisimova M, Gil M, Dufayard JF, Dessimoz C, Gascuel O. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol. 60:685–699. - PMC - PubMed

[8] Anisimova M, Gil M, Dufayard JF, Dessimoz C, Gascuel O. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol. 60:685–699. - PMC - PubMed

[9] Antoniak CE. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat. 2:1152–1174.

[10] Antoniak CE. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat. 2:1152–1174.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering Genes of Common Evolutionary History

Affiliations

Clustering Genes of Common Evolutionary History

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases