. 2024 Mar 29;40(4):btae154.

doi: 10.1093/bioinformatics/btae154.

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Fang Wang^{1

2}, Yibin Wang², Xiaofei Zeng³, Shengcheng Zhang², Jiaxin Yu², Dongxi Li¹, Xingtan Zhang²

Affiliations

¹ College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China.
² National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China.
³ Department of Human Cell Biology and Genetics, Joint Laboratory of Guangdong-Hong Kong Universities for Vascular Homeostasis and Diseases, School of Medicine, Southern University of Science and Technology, Shenzhen, Guangdong 508055, China.

PMID: 38547397
PMCID: PMC10990684
DOI: 10.1093/bioinformatics/btae154

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Fang Wang et al. Bioinformatics. 2024.

. 2024 Mar 29;40(4):btae154.

doi: 10.1093/bioinformatics/btae154.

Authors

Fang Wang^{1

2}, Yibin Wang², Xiaofei Zeng³, Shengcheng Zhang², Jiaxin Yu², Dongxi Li¹, Xingtan Zhang²

Affiliations

¹ College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi 030024, China.
² National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China.
³ Department of Human Cell Biology and Genetics, Joint Laboratory of Guangdong-Hong Kong Universities for Vascular Homeostasis and Diseases, School of Medicine, Southern University of Science and Technology, Shenzhen, Guangdong 508055, China.

PMID: 38547397
PMCID: PMC10990684
DOI: 10.1093/bioinformatics/btae154

Abstract

Motivation: Constructing a phylogenetic tree requires calculating the evolutionary distance between samples or species via large-scale resequencing data, a process that is both time-consuming and computationally demanding. Striking the right balance between accuracy and efficiency is a significant challenge.

Results: To address this, we introduce a new algorithm, MIKE (MinHash-based k-mer algorithm). This algorithm is designed for the swift calculation of the Jaccard coefficient directly from raw sequencing reads and enables the construction of phylogenetic trees based on the resultant Jaccard coefficient. Simulation results highlight the superior speed of MIKE compared to existing state-of-the-art methods. We used MIKE to reconstruct a phylogenetic tree, incorporating 238 yeast, 303 Zea, 141 Ficus, 67 Oryza, and 43 Saccharum spontaneum samples. MIKE demonstrated accurate performance across varying evolutionary scales, reproductive modes, and ploidy levels, proving itself as a powerful tool for phylogenetic tree construction.

Availability and implementation: MIKE is publicly available on Github at https://github.com/Argonum-Clever2/mike.git.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
The process of simulating data and overview of MIKE algorithm. (a) The process of simulating data. Four sets of monoploid datasets were simulated, including haploid, autotetraploid, allotetraploid, and polyploid. Beginning with the same ancestral chromosomes, some regions are designated as conservative (no mutations) and others as non-conservative (allowing mutations). Polyploids can generate new genomes through processes such as whole-genome duplication (WGD) and hybridization. The polyploid datasets were simulated with six ploidy levels: diploid, tetraploid, hexaploid, octoploid, dodecaploid, and hexadecaploid. Each dataset is designed to undergo mutations in each generation, resulting in the generation of n different offspring in the process. (b) The overview of MIKE algorithm. First, the sequencing reads are divided into k-mers, with k set to 21 by default. A mapping is defined to represent each character using 2 bits, where A, C, G, and T correspond to 00, 01, 10, and 11, respectively. Each k-mer is split into two parts, defined as a prefix and a suffix, $k_{pre}$ and $k_{suf}$ . Subsequently, k-mers with the same $k_{pre}$ are grouped together. Within each group, a random shuffled permutation $π$ with a numerical range of $[1, \max (k_{suf})]$ is applied. All $k_{suf}$ values for each group are marked with either 0 or 1 to create one-hot vectors, where a value is marked as 1 if the $k_{suf}$ occurs and 0 if it does not. These vectors are then multiplied by the permutation $π$ and the smallest non-zero value $h_{π} (c)$ is selected as the representative feature value for that group, known as the minhash fingerprint. This minhash fingerprint can effectively represent the original sequencing data

**Figure 2.**
Computational efficiencies. (a) The process of simulating data. (b) Four sets of monoploid datasets were simulated, including haploid, autotetraploid, allotetraploid, and polyploid

**Figure 3.**
The phylogenetic tree of Ficus. (a, b) The phylogenetic tree constructed using MIKE for 141 samples of the genus *Ficus*. (c) The phylogenetic tree for 22 species selected from the subgenus *Sycomorus* of the genus *Ficus*, constructed using both MIKE and CallSNPs through BIONJ methods

**Figure 4.**
The phylogenetic tree constructed using MIKE for 303 samples of the genus *Zea*

**Figure 5.**
The phylogenetic tree constructed using MIKE for 42 samples of *S.spontaneum*

See this image and copyright information in PMC

References

1. Alexandrov N, Tai S, Wang W. et al. SNP-seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res 2015;43:D1023–7. - PMC - PubMed
1. Batley J, Edwards D.. SNP applications in plants. In Oraguzie NC, Rikkerink EHA, Gardiner SE, De Silva HN (eds) Association Mapping in Plants, pp. 95–102. Berlin: Springer, 2007.
1. Berlin K, Koren S, Chin C-S. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015;33:623–30. - PubMed
1. Bos DH, Posada D.. Using models of nucleotide evolution to build phylogenetic trees. Dev Comp Immunol 2005;29:211–27. - PubMed
1. Buhler J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 2001;17:419–28. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Affiliations

MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous