Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 5;42(3):msaf032.
doi: 10.1093/molbev/msaf032.

A Tale of Too Many Trees: A Conundrum for Phylogenetic Regression

Affiliations

A Tale of Too Many Trees: A Conundrum for Phylogenetic Regression

Richard Adams et al. Mol Biol Evol. .

Abstract

Just exactly which tree(s) should we assume when testing evolutionary hypotheses? This question has plagued comparative biologists for decades. Though all phylogenetic comparative methods require input trees, we seldom know with certainty whether even a perfectly estimated tree (if this is possible in practice) is appropriate for our studied traits. Yet, we also know that phylogenetic conflict is ubiquitous in modern comparative biology, and we are still learning about its dangers when testing evolutionary hypotheses. Here, we investigate the consequences of tree-trait mismatch for phylogenetic regression in the presence of gene tree-species tree conflict. Our simulation experiments reveal excessively high false positive rates for mismatched models with both small and large trees, simple and complex traits, and known and estimated phylogenies. In some cases, we find evidence of a directionality of error: assuming a species tree for traits that evolved according to a gene tree sometimes fares worse than the opposite. We also explored the impacts of tree choice using an expansive, cross-species gene expression dataset as an arguably "best-case" scenario in which one may have a better chance of matching tree with trait. Offering a potential path forward, we found promise in the application of a robust estimator as a potential, albeit imperfect, solution to some issues raised by tree mismatch. Collectively, our results emphasize the importance of careful study design for comparative methods, highlighting the need to fully appreciate the role of accurate and thoughtful phylogenetic modeling.

Keywords: Brownian motion; comparative biology; continuous traits; phylogeny.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest: The authors of this study do not report any conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Illustrating the phylogenetic conundrum. Examples showing species tree and gene tree pairs (a and b) and their associated data generating models (center) for scenarios in which both traits are generated according to the species tree S (top row) or the gene tree G (bottom row). Thus, the true generating tree is shown on the left in (a) and (b). Branch colors (a and b) illustrate values of the response trait y when mapped to the respective tree using the contMap function from phytools. Two examples (random replicates) of matched phylogenetic regression are shown for SS (c) and GG (d), in which the same tree was used for both generating the trait data and computing PICs, and two examples (random replicates) of mismatched regression are shown for SG (e) and GS (f), in which different trees were used for generating the trait data and computing PICs.
Fig. 2.
Fig. 2.
Tree mismatch exacerbates evidence of false trait associations with phylogenetic regression. Estimates of the false positive rate (a to c), P-value distributions for GS (d to f), P-value distributions for SG (g to i), means and standard deviations of Robinson–Foulds (j to l), and Hellinger (m to o) between gene trees and species trees from simulations including 10 species (top row), 100 species (middle row), and 1,000 species (bottom row) for birth–death simulations with birth rate λ, death rate λ/2, and root age of 10 coalescent units. The two traits were statistically independent (β=0) for all simulations. Dashed horizontal lines mark the commonly used false positive rate α=0.05 in (a) to (c), median P-values taken from matched GG scenarios in (d) to (f), and median P-values from matched SS scenarios in (g) to (i). The y-axis ranges from 0 to 1 in all panels.
Fig. 3.
Fig. 3.
Tree mismatch misleads phylogenetic regression for traits with more complex architectures. Estimates of the false positive rates from simulations including 10 species (top row), 100 species (middle row), and 1,000 species (bottom row) for birth–death simulations with birth rate λ, death rate λ/2, and root age of 10 coalescent units for mismatched species tree regression (red lines), mismatched gene tree regression (pink links), and matched gene tree sets (black lines). Results shown for traits encoded by two loci (a to c), five loci (d to f), 10 loci (g to i), and 100 loci (j to l). The two traits were statistically independent (β=0) for all simulations. Horizontal dashed lines mark the commonly used false positive rate of α=0.05.
Fig. 4.
Fig. 4.
Case studying the impacts of both tree mismatch and tree estimation error on phylogenetic regression. Depicted are false positive rates of the two mismatched scenarios (GS and SG) and the two matched scenarios (GG and SS) when regression was performed with known trees (a) and estimated trees (b) for n=10 species. Difference between log-scaled P-values obtained with known and estimated trees (c).
Fig. 5.
Fig. 5.
Tree mismatch influences power to detect true trait associations. Estimates of true positive rates for 10 species (top row), 100 species (middle row), and 1,000 species (bottom row) for birth–death simulations with birth rate λ, death rate λ/2, and root age of 10 coalescent units. Results are shown for β=0.25 (a to c), β=0.50 (d to f), β=0.75 (g to i), and β=1.0 (j to l).
Fig. 6.
Fig. 6.
Tree choice matters when testing female–male expression associations across species. Results shown across 22 autosomes for heart tissue expression measurements of 4,068 genes with measurable expression across species, with computed distance statistics dS (inner track a), dN(middle track b), and dA (outer track c) based on L2-based phylogenetic regression. Empirical case studies comparing phylogenetic regression based on the species tree (d), nucleotide gene tree (e), and amino acid gene tree (f) are shown for analyses with anomalously high dS, dN, and dA, respectively. Colors of points in circos plot (a to c) indicate relative level of divergent P-values, with blue indicating not significant, black indicating P-value <0.05, and red indicating strong outliers with P-value <1.229×105 after applying Bonferroni correction (Bonferroni 1936). Points depicted as gray stars indicate evidence of singular phylogenetic outliers found in specific analyses (d to f).
Fig. 7.
Fig. 7.
Venn diagrams displaying the percentage of overlap in statistically significant genes for brain (a), heart (b), and kidney (c) expression levels in a mammalian dataset based on phylogenetic regression applied by assuming the species tree (left circles), nucleotide gene tree (top circles), or amino acid gene tree (right circles). Colors indicate the relative percentage of statistically significant genes across analyses.
Fig. 8.
Fig. 8.
Violin plots summarizing the distributions of model fit measured by log-likelihood for phylogenetic regression applied to gene expression from a mammalian dataset. Results shown across tissues (heart, brain, and kidney) and the three regression strategies that assume either the species tree, nucleotide gene tree, or amino acid gene tree.
Fig. 9.
Fig. 9.
Can robust estimators help? Results showing estimated false positive rates when using known trees with 10 species (a), 100 species (b), and 1,000 species (c) for robust L1-based regression (dashed lines) alongside standard L2-based regression (solid lines) under birth–death simulations with birth rate λ, death rate λ/2, and root age of 10 coalescent units. Estimated false positive rates are also shown for L1- and L2-based regression with estimated trees for our simulation case study with n=10 species (d). Horizontal solid gray lines mark the typically accepted false positive rate of 0.05.
Fig. 10.
Fig. 10.
Can robust estimators help with tree mismatch? Empirical examples from the mammalian gene expression data contrasting differences between standard L2-based (black lines) and robust L1-based (blue dashed lines) regression using the species tree (top row), nucleotide gene trees (middle row), and amino acid gene trees (bottom row).

References

    1. Adams DC. Comparing evolutionary rates for different phenotypic traits on a phylogeny using likelihood. Syst Biol. 2013:62(2):181–192. 10.1093/sysbio/sys083. - DOI - PubMed
    1. Adams RH, Blackmon H, DeGiorgio M. Of traits and trees: probabilistic distances under continuous trait models for dissecting the interplay among phylogeny, model, and data. Syst Biol. 2021:70(4):660–680. 10.1093/sysbio/syab009. - DOI - PMC - PubMed
    1. Adams RH, Blackmon H, Reyes-Velasco J, Schield DR, Card DC, Andrew AL, Waynewood N, Castoe TA. Microsatellite landscape evolutionary dynamics across 450 million years of vertebrate genome evolution. Genome. 2016:59(5):295–310. 10.1139/gen-2015-0124. - DOI - PubMed
    1. Adams RH, Cain Z, Assis R, DeGiorgio M. Robust phylogenetic regression. Syst Biol. 2024:73(1):140–157. 10.1093/sysbio/syad070. - DOI - PMC - PubMed
    1. Adams RH, Schield DR, Card DC, Castoe TA. Assessing the impacts of positive selection on coalescent-based species tree estimation and species delimitation. Syst Biol. 2018:67(6):1076–1090. 10.1093/sysbio/syy034. - DOI - PubMed

LinkOut - more resources