Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 29;40(4):btae154.
doi: 10.1093/bioinformatics/btae154.

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Affiliations

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Fang Wang et al. Bioinformatics. .

Abstract

Motivation: Constructing a phylogenetic tree requires calculating the evolutionary distance between samples or species via large-scale resequencing data, a process that is both time-consuming and computationally demanding. Striking the right balance between accuracy and efficiency is a significant challenge.

Results: To address this, we introduce a new algorithm, MIKE (MinHash-based k-mer algorithm). This algorithm is designed for the swift calculation of the Jaccard coefficient directly from raw sequencing reads and enables the construction of phylogenetic trees based on the resultant Jaccard coefficient. Simulation results highlight the superior speed of MIKE compared to existing state-of-the-art methods. We used MIKE to reconstruct a phylogenetic tree, incorporating 238 yeast, 303 Zea, 141 Ficus, 67 Oryza, and 43 Saccharum spontaneum samples. MIKE demonstrated accurate performance across varying evolutionary scales, reproductive modes, and ploidy levels, proving itself as a powerful tool for phylogenetic tree construction.

Availability and implementation: MIKE is publicly available on Github at https://github.com/Argonum-Clever2/mike.git.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
The process of simulating data and overview of MIKE algorithm. (a) The process of simulating data. Four sets of monoploid datasets were simulated, including haploid, autotetraploid, allotetraploid, and polyploid. Beginning with the same ancestral chromosomes, some regions are designated as conservative (no mutations) and others as non-conservative (allowing mutations). Polyploids can generate new genomes through processes such as whole-genome duplication (WGD) and hybridization. The polyploid datasets were simulated with six ploidy levels: diploid, tetraploid, hexaploid, octoploid, dodecaploid, and hexadecaploid. Each dataset is designed to undergo mutations in each generation, resulting in the generation of n different offspring in the process. (b) The overview of MIKE algorithm. First, the sequencing reads are divided into k-mers, with k set to 21 by default. A mapping is defined to represent each character using 2 bits, where A, C, G, and T correspond to 00, 01, 10, and 11, respectively. Each k-mer is split into two parts, defined as a prefix and a suffix, kpre and ksuf. Subsequently, k-mers with the same kpre are grouped together. Within each group, a random shuffled permutation π with a numerical range of [1, max(ksuf)] is applied. All ksuf values for each group are marked with either 0 or 1 to create one-hot vectors, where a value is marked as 1 if the ksuf occurs and 0 if it does not. These vectors are then multiplied by the permutation π and the smallest non-zero value hπ(c) is selected as the representative feature value for that group, known as the minhash fingerprint. This minhash fingerprint can effectively represent the original sequencing data
Figure 2.
Figure 2.
Computational efficiencies. (a) The process of simulating data. (b) Four sets of monoploid datasets were simulated, including haploid, autotetraploid, allotetraploid, and polyploid
Figure 3.
Figure 3.
The phylogenetic tree of Ficus. (a, b) The phylogenetic tree constructed using MIKE for 141 samples of the genus Ficus. (c) The phylogenetic tree for 22 species selected from the subgenus Sycomorus of the genus Ficus, constructed using both MIKE and CallSNPs through BIONJ methods
Figure 4.
Figure 4.
The phylogenetic tree constructed using MIKE for 303 samples of the genus Zea
Figure 5.
Figure 5.
The phylogenetic tree constructed using MIKE for 42 samples of S.spontaneum

References

    1. Alexandrov N, Tai S, Wang W. et al. SNP-seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res 2015;43:D1023–7. - PMC - PubMed
    1. Batley J, Edwards D.. SNP applications in plants. In Oraguzie NC, Rikkerink EHA, Gardiner SE, De Silva HN (eds) Association Mapping in Plants, pp. 95–102. Berlin: Springer, 2007.
    1. Berlin K, Koren S, Chin C-S. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015;33:623–30. - PubMed
    1. Bos DH, Posada D.. Using models of nucleotide evolution to build phylogenetic trees. Dev Comp Immunol 2005;29:211–27. - PubMed
    1. Buhler J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 2001;17:419–28. - PubMed

Publication types