Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 10:12:945.
doi: 10.12688/f1000research.139356.2. eCollection 2023.

Scoutknife: A naïve, whole genome informed phylogenetic robusticity metric

Affiliations

Scoutknife: A naïve, whole genome informed phylogenetic robusticity metric

James Fleming et al. F1000Res. .

Abstract

Background: The phylogenetic bootstrap, first proposed by Felsenstein in 1985, is a critically important statistical method in assessing the robusticity of phylogenetic datasets. Core to its concept was the use of pseudo sampling - assessing the data by generating new replicates derived from the initial dataset that was used to generate the phylogeny. In this way, phylogenetic support metrics could overcome the lack of perfect, infinite data. With infinite data, however, it is possible to sample smaller replicates directly from the data to obtain both the phylogeny and its statistical robusticity in the same analysis. Due to the growth of whole genome sequencing, the depth and breadth of our datasets have greatly expanded and are set to only expand further. With genome-scale datasets comprising thousands of genes, we can now obtain a proxy for infinite data. Accordingly, we can potentially abandon the notion of pseudo sampling and instead randomly sample small subsets of genes from the thousands of genes in our analyses. Methods: We introduce Scoutknife, a jackknife-style subsampling implementation that generates 100 datasets by randomly sampling a small number of genes from an initial large-gene dataset to jointly establish both a phylogenetic hypothesis and assess its robusticity. We assess its effectiveness by using 18 previously published datasets and 100 simulation studies. Results: We show that Scoutknife is conservative and informative as to conflicts and incongruence across the whole genome, without the need for subsampling based on traditional model selection criteria. Conclusions: Scoutknife reliably achieves comparable results to selecting the best genes on both real and simulation datasets, while being resistant to the potential biases caused by selecting for model fit. As the amount of genome data grows, it becomes an even more exciting option to assess the robusticity of phylogenetic hypotheses.

Keywords: bootstrapping; phylogenetics; software.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. A figure showing a bootstrap pseudosampling process (Panel A) and a Scoutknife sampling process (Panel C), with the theoretical unlimited data jackknife sample in the middle (Panel B).
Note that Scoutknife bears more similarity to unlimited data sampling than a traditional bootstrap. Scoutknife may not take the same gene twice within the same sample but may take the same gene multiple times between samples – see Scoutknife replicate #1 and #2, which both sample gene #97. The structure of this figure is based upon Hillis et al. (1996), Chapter 11, page 508, Figure 33.
Figure 2.
Figure 2.. A dual bar chart showing proportion of non-conflicting nodes (in blue) and explicitly agreeing nodes (in orange) for each dataset.
The two datasets discussed further in the text, Araneae and Lepidoptera, are highlighted in light blue (for non-conflict) and red (for explicit agreement) respectively.
Figure 3.
Figure 3.. A violin plot showing the distribution of gene occupancy across the Araneae dataset by Fernández et al. (2018).
A large proportion of low occupancy genes may cause issues for Scoutknife resolution.
Figure 4.
Figure 4.. A violin plot showing the distribution of Marczewski-Steinhaus values between Scoutknife Consensus trees and the GeneSortR Most Informative 250 Genes Tree at both a 0.7 strict consensus and 0.5 majority consensus.
Note the long tail on the Majority Marczewkski-Steinhaus violin, representing Simulation 20.

References

    1. Fleming JF, Valero-Gracia A, Struck TH: Identifying and addressing methodological incongruence in phylogenomics: A review. Evol. Appl. 2023;16:1087–1104. 10.1111/eva.13565 - DOI - PMC - PubMed
    1. Wolfe KH, Li W-H: Molecular evolution meets the genomics revolution. Nat. Genet. 2003;33(3):255–265. 10.1038/ng1088 - DOI - PubMed
    1. Gee H: Ending incongruence. Nature. 2003 2003/10;425(6960):782. 10.1038/425782a - DOI - PubMed
    1. Bortoluzzi C, Wright CJ, Lee S, et al. : Lepidoptera genomics based on 88 chromosomal reference sequences informs population genetic parameters for conservation. bioRxiv. 2023:2023.04.14.536868.
    1. Challis R, Kumar S, Sotero-Caio C, et al. : Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life. Wellcome Open Res. 2023;8(24):24. 10.12688/wellcomeopenres.18658.1 - DOI - PMC - PubMed