Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model
- PMID: 40489624
- PMCID: PMC12184517
- DOI: 10.1073/pnas.2421738122
Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model
Abstract
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Keywords: angiosperm; deep learning; deleterious mutation; gene annotation; language model.
Conflict of interest statement
Competing interests statement:M.C.R. (co-author) assisted in organizing a yield prediction contest in which Shiu participated. Both were co-authors in a community-wide publication summarizing the contest results. They have never met and had no direct collaboration beyond these publicly coordinated activities. The other authors declare no competing interests.
Update of
-
Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model.bioRxiv [Preprint]. 2024 Aug 22:2024.06.04.596709. doi: 10.1101/2024.06.04.596709. bioRxiv. 2024. Update in: Proc Natl Acad Sci U S A. 2025 Jun 17;122(24):e2421738122. doi: 10.1073/pnas.2421738122. PMID: 38895432 Free PMC article. Updated. Preprint.
Comment in
-
Decoding nature's grammar with DNA language models.Proc Natl Acad Sci U S A. 2025 Jul 22;122(29):e2512889122. doi: 10.1073/pnas.2512889122. Epub 2025 Jul 14. Proc Natl Acad Sci U S A. 2025. PMID: 40658864 Free PMC article. No abstract available.
References
-
- Sun Y., Shang L., Zhu Q.-H., Fan L., Guo L., Twenty years of plant genome sequencing: Achievements and challenges. Trends Plant Sci. 27, 391–401 (2022). - PubMed
-
- Soltis P. S., Soltis D. E., Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants People Planet 3, 74–82 (2021).
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources