Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 17;122(24):e2421738122.
doi: 10.1073/pnas.2421738122. Epub 2025 Jun 9.

Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model

Affiliations

Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model

Jingjing Zhai et al. Proc Natl Acad Sci U S A. .

Abstract

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.

Keywords: angiosperm; deep learning; deleterious mutation; gene annotation; language model.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:M.C.R. (co-author) assisted in organizing a yield prediction contest in which Shiu participated. Both were co-authors in a community-wide publication summarizing the contest results. They have never met and had no direct collaboration beyond these publicly coordinated activities. The other authors declare no competing interests.

Update of

Comment in

  • Decoding nature's grammar with DNA language models.
    Morrell PL, Pakhomov SV. Morrell PL, et al. Proc Natl Acad Sci U S A. 2025 Jul 22;122(29):e2512889122. doi: 10.1073/pnas.2512889122. Epub 2025 Jul 14. Proc Natl Acad Sci U S A. 2025. PMID: 40658864 Free PMC article. No abstract available.

References

    1. One Thousand Plant Transcriptomes Initiative, One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574, 679–685 (2019). - PMC - PubMed
    1. Marks R. A., Hotaling S., Frandsen P. B., VanBuren R., Representation and participation across 20 years of plant genome sequencing. Nat. Plants 7, 1571–1578 (2021). - PMC - PubMed
    1. Sun Y., Shang L., Zhu Q.-H., Fan L., Guo L., Twenty years of plant genome sequencing: Achievements and challenges. Trends Plant Sci. 27, 391–401 (2022). - PubMed
    1. Soltis P. S., Soltis D. E., Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants People Planet 3, 74–82 (2021).
    1. Provart N. J., et al. , Anno genominis XX: 20 years of Arabidopsis genomics. Plant Cell 33, 832–845 (2021). - PMC - PubMed

LinkOut - more resources