Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jun;26(6):395-405.
doi: 10.1038/s41576-024-00803-0. Epub 2025 Jan 8.

A phylogenetic approach to comparative genomics

Affiliations
Review

A phylogenetic approach to comparative genomics

Anna E Dewar et al. Nat Rev Genet. 2025 Jun.

Abstract

Comparative genomics, whereby the genomes of different species are compared, has the potential to address broad and fundamental questions at the intersection of genetics and evolution. However, species, genomes and genes cannot be considered as independent data points within statistical tests. Closely related species tend to be similar because they share genes by common descent, which must be accounted for in analyses. This problem of non-independence may be exacerbated when examining genomes or genes but can be addressed by applying phylogeny-based methods to comparative genomic analyses. Here, we review how controlling for phylogeny can change the conclusions of comparative genomics studies. We address common questions on how to apply these methods and illustrate how they can be used to test causal hypotheses. The combination of rapidly expanding genomic datasets and phylogenetic comparative methods is set to revolutionize the biological insights possible from comparative genomic studies.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Figure 1
Figure 1. Species are not independent data points.
A.i. There seems to be a strong positive correlation between X and Y, where each dot is a species (n=16 species). ii. In an extreme scenario those species are from two separate, monophyletic lineages represented by green and pink dots. When mapped onto the original scatterplot, the two lineages form largely separate clusters. Within each of those clusters, there is no relationship between X and Y. The original correlation was just an artefact of the mean values of X and Y in the pink lineage being larger than the green lineage. Inspired by Figures 5-7 in Felsenstein, 1985. B. A simple hypothetical genomic example. Each tip is a bacterial species; red and blue cells correspond to pathogenic and non-pathogenic species, respectively, and species which carry toxA are indicated by cells with the gene present. Using the species at the tips of the tree as independent data points, we might spuriously conclude that toxA facilitates the evolution of pathogenicity: 85% of pathogenic species carry toxA (6/7 species), compared to only 25% of non-pathogenic species (1/4 species) (Chi-squared=4.05, p<0.05). However, this significant correlation is an artefact of shared evolutionary history. Pathogenicity only evolved twice in the phylogeny: once in a lineage with toxA and once in a lineage without toxA. Rather than independent data points, the cluster of pathogenic, toxA carrying species are analogous to technical pseudoreplicates in an empirical study.
Figure 2
Figure 2. Phylogenetic bias of genome sequencing.
Order level visualisation of the GTDB bacterial phylogeny (v.214); the size of dots corresponds to the number of complete bacterial genomes represented in both the RefSeq and the GTDB database from each taxonomic order. Labels correspond to the five orders with the most genome sequences.
Figure 3
Figure 3. Testing causal hypotheses.
A. Imagine that bacterial species which carried the gene toxB were more likely to be pathogens. Can we test if causality is in that direction, with toxB favouring the transition to pathogenicity? B. Yes, we can use transition rate methods,,,. For the two binary traits pathogenicity and toxB presence, there are four possible states, each represented by a cell. The quantity of evolutionary transitions between these states across the phylogeny is indicated by arrows: larger arrows correspond to evolutionary changes which occur more frequently. We can see that almost all transitions to pathogenicity (blue → red) occur when the species already has the toxB gene. Most often, non-pathogens first acquire the toxB gene, and then evolve pathogenicity, suggesting that toxB does help pathogenicity to evolve.
Figure 4
Figure 4. Statistical significance and biological importance.
The effect size (R2; proportion of variance explained) required to produce a statistically significant result decreases with the number of data points (sample size). For an unpaired t-test, the x-axis is the number of data points in each of the two groups compared within the t-test (N1=N2=N) and the y-axis is the minimum R2 value which could be significant to one of five p-values for a given N. The lines show the numerical relationship between the axes for each of five p-values, which are all considered statistically significant. The area corresponding to <5% of the variance explained is shaded in grey (R2<0.05). Larger datasets can detect smaller effects, but very large datasets will assign almost all effects as significant, even when they explain far less than 5% of the variance.

Similar articles

Cited by

References

    1. Binnewies TT, et al. Ten years of bacterial genome sequencing: comparative-genomics-based discoveries. Funct Integr Genomics. 2006;6:165–185. - PubMed
    1. Land M, et al. Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics. 2015;15:141–161. doi: 10.1007/s10142-015-0433-4. - DOI - PMC - PubMed
    1. O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–745. doi: 10.1093/nar/gkv1189. - DOI - PMC - PubMed
    1. The Darwin Tree of Life Project Consortium. Sequence locally, think globally: The Darwin Tree of Life Project. Proceedings of the National Academy of Sciences. 2022;119:e2115642118. doi: 10.1073/pnas.2115642118. - DOI - PMC - PubMed
    1. Hunt M, Lima L, Shen W, Lees J, Iqbal Z. AllTheBacteria - all bacterial genomes assembled, available and searchable. 2024:2024.03.08.584059. doi: 10.1101/2024.03.08.584059. Preprint. - DOI

LinkOut - more resources