Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 23;38(9):4025-4038.
doi: 10.1093/molbev/msab151.

Phylogenomic Subsampling and the Search for Phylogenetically Reliable Loci

Affiliations

Phylogenomic Subsampling and the Search for Phylogenetically Reliable Loci

Nicolás Mongiardino Koch. Mol Biol Evol. .

Abstract

Phylogenomic subsampling is a procedure by which small sets of loci are selected from large genome-scale data sets and used for phylogenetic inference. This step is often motivated by either computational limitations associated with the use of complex inference methods or as a means of testing the robustness of phylogenetic results by discarding loci that are deemed potentially misleading. Although many alternative methods of phylogenomic subsampling have been proposed, little effort has gone into comparing their behavior across different data sets. Here, I calculate multiple gene properties for a range of phylogenomic data sets spanning animal, fungal, and plant clades, uncovering a remarkable predictability in their patterns of covariance. I also show how these patterns provide a means for ordering loci by both their rate of evolution and their relative phylogenetic usefulness. This method of retrieving phylogenetically useful loci is found to be among the top performing when compared with alternative subsampling protocols. Relatively common approaches such as minimizing potential sources of systematic bias or increasing the clock-likeness of the data are found to fare worse than selecting loci at random. Likewise, the general utility of rate-based subsampling is found to be limited: loci evolving at both low and high rates are among the least effective, and even those evolving at optimal rates can still widely differ in usefulness. This study shows that many common subsampling approaches introduce unintended effects in off-target gene properties and proposes an alternative multivariate method that simultaneously optimizes phylogenetic signal while controlling for known sources of bias.

Keywords: molecular evolution; phylogenetic inference; phylogenetic signal; phylogenomics; systematic biases.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Gene properties covary in predictable ways, revealing underlying patterns of evolution that are shared by all phylogenomic data sets. The dendrogram shows that the eigenvectors of PC axes can be clustered into two major groups, labeled as patterns A and B. While pattern A is generally captured by PC 1 (green icons) and pattern B by PC 2 (orange icons), the hexapod and phasmatodean data sets are inverted. The histograms on the bottom she the distribution of loadings across variables. Results using k-means clustering are shown in supplementary figure S3, Supplementary Material online.
Fig. 2.
Fig. 2.
Rate of evolution is the primary factor driving differences in gene properties. Scores of loci along PCs 1 (A) and 2 (B) were correlated against the log-transformed harmonic means of site rates. Blue lines correspond to LOESS regressions, and Spearman’s rank correlation coefficients (ρ) are shown in each plot. Clade icons are as in figure 1; the deviating hexapod and phasmatodean data sets are highlighted in red. Results using a tree-based estimate of evolutionary rates are shown in supplementary figure S4, Supplementary Material online.
Fig. 3.
Fig. 3.
Comparison of the performance of alternative subsampling strategies. (A) Distribution of ranks attained by different strategies (lower ranks represent better results). Two criteria for selecting adequate strategies are highlighted: those whose median ranks are lower than randomly chosen loci (grey background), and those that outperform these in more than half of the data sets (yellow bars). The proportion of times a given strategy ranks better than random loci is shown at the bottom. Results correspond to matrices of 250 loci; those for 50 loci are shown in supplementary figure S7, Supplementary Material online. (B) NMDS of pair-wise distances between strategies, representing the average frequency with which they share loci (smaller distances represent higher probabilities of targeting the same loci). Average RF similarity (orange lines) is overlayed as a smooth surface. PC 2 defines an axis that traverses the RF similarity gradient, whereas PC 1 (and other rate proxies) sample genes along a perpendicular axis that follows an isocline.
Fig. 4.
Fig. 4.
Detection of outlier genes using multiple gene properties in two exemplary data sets, Lepidoptera (left) and Pseudoscorpiones (right). Plots show the PC axes built from the entire data sets, with the genes considered outliers shown in red. The topology of the largest outlier (highlighted with a black border) is plotted.

Similar articles

Cited by

References

    1. Aguileta G, Marthey S, Chiapello H, Lebrun M-H, Rodolphe F, Fournier E, Gendrault-Jacquemard A, Giraud T.. 2008. Assessing the performance of single-copy genes for recovering robust phylogenies. Syst Biol. 57(4):613–627. - PubMed
    1. Alda F, Tagliacollo VA, Bernt MJ, Waltz BT, Ludt WB, Faircloth BC, Alfaro ME, Albert JS, Chakrabarty P.. 2019. Resolving deep nodes in an ancient radiation of neotropical fishes in the presence of conflicting signals from incomplete lineage sorting. Syst Biol. 68(4):573–593. - PubMed
    1. Arcila D, Ortí G, Vari R, Armbruster JW, Stiassny ML, Ko KD, Sabaj MH, Lundberg J, Revell LJ, Betancur-R R.. 2017. Genome-wide interrogation advances resolution of recalcitrant groups in the tree of life. Nat Ecol Evol. 1(2):20–10. - PubMed
    1. Ballesteros JA, Santibáñez López CE, Kováč Ľ, Gavish-Regev E, Sharma PP.. 2019. Ordered phylogenomic subsampling enables diagnosis of systematic errors in the placement of the enigmatic arachnid order Palpigradi. Proc Biol Sci. 286(1917):20192426. - PMC - PubMed
    1. Bellot S, Mitchell TC, Schaefer H.. 2020. Phylogenetic informativeness analyses to clarify past diversification processes in Cucurbitaceae. Sci Rep. 10(1):13. - PMC - PubMed

Publication types