PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model

Danruo Deng^#¹, Wuqin Xu^#², Bian Wu³, Hans Peter Comes⁴, Yu Feng⁵, Pan Li⁶, Jinfang Zheng⁷, Guangyong Chen⁸, Pheng-Ann Heng¹

Affiliations

¹ Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.
² Zhejiang Lab, Kechuang Avenue, Hangzhou, China. xuwuqin@zhejianglab.com.
³ Zhejiang Lab, Kechuang Avenue, Hangzhou, China.
⁴ Department of Environment & Biodiversity, Salzburg University, Salzburg, Austria.
⁵ Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, Sichuan, China.
⁶ Systematic & Evolutionary Botany and Biodiversity Group, State Key Laboratory for Vegetation Structure, Function and Construction, College of Life Sciences, Zhejiang University, Hangzhou, China. panli@zju.edu.cn.
⁷ Zhejiang Lab, Kechuang Avenue, Hangzhou, China. zhengjinfang1220@gmail.com.
⁸ Hangzhou Institute of Medicine Chinese Academy of Sciences, Hangzhou, China. chenguangyong@him.cas.cn.

^# Contributed equally.

PMID: 40715068
PMCID: PMC12297363
DOI: 10.1038/s41467-025-61684-3

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model

Danruo Deng et al. Nat Commun. 2025.

. 2025 Jul 26;16(1):6905.

doi: 10.1038/s41467-025-61684-3.

Authors

Danruo Deng^#¹, Wuqin Xu^#², Bian Wu³, Hans Peter Comes⁴, Yu Feng⁵, Pan Li⁶, Jinfang Zheng⁷, Guangyong Chen⁸, Pheng-Ann Heng¹

Affiliations

¹ Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.
² Zhejiang Lab, Kechuang Avenue, Hangzhou, China. xuwuqin@zhejianglab.com.
³ Zhejiang Lab, Kechuang Avenue, Hangzhou, China.
⁴ Department of Environment & Biodiversity, Salzburg University, Salzburg, Austria.
⁵ Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, Sichuan, China.
⁶ Systematic & Evolutionary Botany and Biodiversity Group, State Key Laboratory for Vegetation Structure, Function and Construction, College of Life Sciences, Zhejiang University, Hangzhou, China. panli@zju.edu.cn.
⁷ Zhejiang Lab, Kechuang Avenue, Hangzhou, China. zhengjinfang1220@gmail.com.
⁸ Hangzhou Institute of Medicine Chinese Academy of Sciences, Hangzhou, China. chenguangyong@him.cas.cn.

^# Contributed equally.

PMID: 40715068
PMCID: PMC12297363
DOI: 10.1038/s41467-025-61684-3

Abstract

Understanding the phylogenetic relationships among species is crucial for comprehending major evolutionary transitions. Despite the ever-growing volume of sequence data, constructing reliable phylogenetic trees effectively becomes more challenging for current analytical methods. In this study, we introduce a new solution to accelerate the integration of novel taxa into an existing phylogenetic tree using a pretrained DNA language model. Our approach identifies the taxonomic unit of a newly collected sequence using existing taxonomic classification systems and updates the corresponding subtree. Specifically, we leverage a pretrained BERT network to obtain high-dimensional sequence representations, which are used not only to determine the subtree to be updated, but also identify potentially valuable regions for subtree construction. We demonstrate the effectiveness of our method, named PhyloTune, through experiments on simulated datasets, as well as our curated Plant (focusing on Embryophyta) and microbial (focusing on Bordetella genus) datasets. Our findings provide evidence that phylogenetic trees can be constructed by automatically selecting the most informative regions of sequences, without manual selection of molecular markers. This discovery offers a guide for further research into the functional aspects of different regions of DNA sequences, enriching our understanding of biology.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Overview of the tree update process and PhyloTune methodology.**
a Compared to the standard pipeline, PhyloTune introduces an innovative framework tailored to constrain updates within a specified subtree. By precisely identifying potentially informative regions within the subtree sequences, PhyloTune reduces the number and length of input sequences for alignment (e.g., MAFFT) and tree inference tools (e.g., RAxML), thereby improving tree update efficiency. b Overview of PhyloTune. For a given phylogenetic tree requiring updates, hierarchical linear probes (HLPs) are specifically designed to align with its taxonomic hierarchy. These probes are fine-tuned on a pre-trained DNA model to accurately classify query sequences at the smallest taxonomic unit within the specified tree while extracting high-attention regions from all sequences within the corresponding clade. c The PhyloTune model architecture tailored for the Plant dataset. It integrates a Transformer-based BERT module inherited from DNABERT and incorporates HLPs covering four taxonomic ranks: class, order, family, and genus.

**Fig. 2. Performance comparison of phylogenetic tree updating methods.**
a Schematic overview of tree reconstruction strategies. The original tree is built from sequences simulated on the ground-truth tree (gt) with one sequence removed as the new sequence. Updates are performed using: all sequences (complete tree), full-length sequences of a target subtree (full-length tree), or high-attention regions of the subtree (high-attention region tree). Using the example of the addition of species 4, the updated parts of the three trees using RAxML are highlighted in blue. b, c Robinson-Foulds (RF) distance and construction time compare updated trees to gt. Each box plot (n=5 independent experiments) shows the median, interquartile range (25th to 75th percentile), and whiskers to minima/maxima within 1.5 times IQR. d Example of updating the phylogenetic tree using PhyloTune. The original tree consisted of 677 species of 20 orders from Embryophyta. The tree was built using RAxML, with organisms colored based on order. The scale represents the normalized fraction of total branch length. The rugged bars at the outer circle represent the normalized length of input DNA sequences. (i) Update of out-of-distribution (OOD) sequences: the three newly added sequences belong to the order Fabales, but do not belong to any families or genera in the original tree, so only the subtree of Fabales is updated. (ii) Update of in-distribution (ID) sequences: the two newly added sequences belong to the genus *Primulina*, so the subtree of *Primulina* is updated. e Time comparison for the example tree. Blue and orange curves show subtree reconstruction times using full-length sequences and high-attention regions (one-third of the full length), respectively. The red dotted line indicates the time needed to update the tree using all sequences (about 20.1 h). Source data are provided as a Source Data file.

**Fig. 3. PhyloTune’s performance in identifying the smallest taxonomic unit.**
a PhyloTune’s performance in identifying the smallest taxonomic unit on simulated datasets with varying training sequences. Top: Taxonomic classification metrics for known taxa. Bottom: Novelty detection metrics for unknown taxa. Line charts show the mean ± 95% confidence interval (CI, computed from SEM, n = 30 independent experiments). b Comparison of taxonomic classification between PhyloTune and MMseqs2 (using training data as a reference). Line charts show the mean ± 95% CI (n = 10 independent experiments). c Comparative analysis of novelty detection scores for PhyloTune and baseline, using in-distribution (ID) and out-of-distribution (OOD) test sequences from the Plant dataset (n = 15000 sequences). Source data are provided as a Source Data file.

**Fig. 4. PhyloTune’s performance in phylogenetic tree reconstruction.**
a Difference in Robinson-Foulds (RF) distance between high-attention regions and full-length (RF(high, full)) versus low-attention regions and full-length (RF(low, full)) on the simulated datasets, across sequence counts and attention region lengths (1/K of the full length). Each box plot displays the median, interquartile range (25th to 75th percentile), and whiskers to minima/maxima within 1.5 times IQR (n = 10 independent experiments). b Difference in RF(high, full) versus RF(low, full) for order subtrees shown in Fig. 2d. c Phylogenetic trees for the angiosperm order Rosales constructed using high-attention (left) and low-attention (right) regions extracted by PhyloTune (same dataset as Fig. 2d). Source data are provided as a Source Data file.

**Fig. 5. Visual analyses of PhyloTune based on nine molecular markers of the Plant test dataset.**
a Comparison of the average AUROC for four taxonomic ranks of different markers in novelty detection (n = 15000 sequences for each rank). b–d Comparison of the average macro precision, macro recall and macro F1-score for four taxonomic ranks of different markers in taxonomic classification (n = 14700, 14400, 13200, 10000 sequences for class, order, family and genus). e Pearson correlation coefficient between average attention, heterozygosity, the fixation index (F_ST), absolute divergence (D_XY), and substitution rate based on nine molecular markers. f Attention heatmap of the chloroplast marker *mat*K using PhyloTune (n = 1000 sequences). The red box highlights the attention peak region for the majority of sequences, with an example of the corresponding DNA sequences displayed above it. g Average attention, heterozygosity, substitution rate, F_ST, and D_XY curves of *mat*K. The blue shaded area denotes the peak region of attention. Source data are provided as a Source Data file.

**Fig. 6. Phylogenetic trees constructed from high-attention regions show greater similarity to full-length sequence trees compared to those built from low-attention regions on the *Bordetella* dataset.**
a Difference in Robinson-Foulds (RF) distance between trees constructed from high-attention regions and full-length sequences (RF(high, full)) versus those constructed from low-attention regions and full-length sequences (RF(low, full)). High- and low-attention regions are defined by dividing the sequence from the *Bordetella* dataset into K = 12 segments and selecting the top M (1, 2, 3, and 4) regions with the highest and lowest attention scores, respectively. b Comparison of phylogenetic tree topologies of clade1 constructed from high- and low-attention regions at different M values. Source data are provided as a Source Data file.

**Fig. 7. Overview of the Plant dataset.**
a Taxonomic hierarchy and sample type. The Plant dataset, built on the Embryophyta, organizes samples across four taxonomic ranks: class, order, family, and genus. Samples are categorized into five types based on their known taxonomic resolution: ① known genus, ② unknown genus but known family, ③ unknown genus and family but known order, ④ unknown genus and family and order but known classes, and ⑤ unknown taxa in all taxonomic ranks. b Splitting of training, validation and test sets. Fifty and 100 samples of each known genus are randomly sampled to generate validation and test sets, and the remaining samples are used for the training set. 100 samples of an unknown genus are randomly sampled to generate another part of the test set. c Distribution of the Plant training set for each of the four taxonomic ranks, all of which exhibit significant class imbalance. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA74, 5088–5090 (1977). - PMC - PubMed
1. Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science311, 1283–1287 (2006). - PubMed
1. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol.1, 1–6 (2016). - PubMed
1. Winter, M., Devictor, V. & Schweiger, O. Phylogenetic diversity and nature conservation: where are we? Trends Ecol. Evol.28, 199–204 (2013). - PubMed
1. Stiller, J. et al. Complexity of avian evolution revealed by family-level genomes. Nature629, 851–860 (2024). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model

Affiliations

PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources