Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 17;122(24):e2421738122.
doi: 10.1073/pnas.2421738122. Epub 2025 Jun 9.

Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model

Affiliations

Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model

Jingjing Zhai et al. Proc Natl Acad Sci U S A. .

Abstract

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.

Keywords: angiosperm; deep learning; deleterious mutation; gene annotation; language model.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:M.C.R. (co-author) assisted in organizing a yield prediction contest in which Shiu participated. Both were co-authors in a community-wide publication summarizing the contest results. They have never met and had no direct collaboration beyond these publicly coordinated activities. The other authors declare no competing interests.

Figures

Fig. 1.
Fig. 1.
Overview of PlantCaduceus. (A) Phylogenetic tree of 16 angiosperm species used for pretraining the PlantCaduceus model. (B) The input for PlantCaduceus consists of 512-bp DNA sequences with 15% of positions randomly masked. The pretraining objective is cross-entropy loss on the masked positions. The sequences are processed through the bidirectional Caduceus architecture, which is based on the Mamba sequence operator—a recently proposed structured SSM. Caduceus also contains a RC equivariance inductive bias. The largest model uses 32 Caduceus blocks. (C) UMAP visualization of embeddings from PlantCaduceus (32 layers) averaged over nonoverlapping 100-bp windows along the sorghum genome without intergenic regions. (D) The same UMAP visualization as in (C) but with intergenic regions.
Fig. 2.
Fig. 2.
Modeling translation and transcription through fine-tuning PlantCaduceus. (A) Classification strategy using PlantCaduceus embeddings: The weights of the pretrained PlantCaduceus model are kept frozen during pretraining. The last hidden state of PlantCaduceus is then used as features for training XGBoost and linear classifiers. (B) Phylogenetic tree of species used for training, validation, and testing during the fine-tuning of PlantCaduceus. (CF) Bar plots displaying the PRAUC scores for six species across four tasks: TIS (C), TTS (D), splice donor (E), and splice acceptor (F). The gene structures on the left illustrate how positive and negative samples are obtained for each classification task. Blue bars represent the PlantCaduceus model with 32 layers. Gray bars denote three DNA language models: NT-v2, AgroNT, and GPN. Light gray bars represent a traditional supervised model, a hybrid of CNN and LSTM. The gray dashed line in each panel indicates the baseline for each dataset, corresponding to the negative sample ratio.
Fig. 3.
Fig. 3.
Evolutionary constraint prediction. (A) Illustration of the evolutionary conservation data curation. (B) ROC and (C) PR curves of different models in sorghum. (D) ROC and (E) PR curves of transferring different models trained in sorghum to unseen maize data.
Fig. 4.
Fig. 4.
Deleterious mutation identification in maize. (A) The zero-shot strategy of PlantCaduceus for identifying deleterious mutations. (B) The zero-shot score distribution of different types of variants generated by in silico mutagenesis in maize chromosome 8. (C) The zero-shot score distribution of 9.4M SNPs in the maize Hapmap3 population. (D) The MAF of putative deleterious mutations prioritized by different models in maize.
Fig. 5.
Fig. 5.
The causal mutation in the Su1 locus. (A) Manhattan plot of the sweet corn trait in the region from 43.0 to 46.0 Mb on chromosome 4. (B) The zero-shot scores of SNPs in 43.0 to 46.0 Mb in chromosome 4, corresponding to the same region as in (A). (C) Scatter plot of zero-shot scores from PlantCaduceus vs. −log10(P) values from GWAS result. The horizontal dashed line indicates the GWAS significance threshold (Bonferroni’s threshold: 0.05/N; N = 2,072,522), and the vertical dashed line marks the top 0.1% percentile of zero-shot scores. (D) Zoomed-in view of the causal variant region and the Su1 gene structure.

Update of

Comment in

  • Decoding nature's grammar with DNA language models.
    Morrell PL, Pakhomov SV. Morrell PL, et al. Proc Natl Acad Sci U S A. 2025 Jul 22;122(29):e2512889122. doi: 10.1073/pnas.2512889122. Epub 2025 Jul 14. Proc Natl Acad Sci U S A. 2025. PMID: 40658864 Free PMC article. No abstract available.

References

    1. One Thousand Plant Transcriptomes Initiative, One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574, 679–685 (2019). - PMC - PubMed
    1. Marks R. A., Hotaling S., Frandsen P. B., VanBuren R., Representation and participation across 20 years of plant genome sequencing. Nat. Plants 7, 1571–1578 (2021). - PMC - PubMed
    1. Sun Y., Shang L., Zhu Q.-H., Fan L., Guo L., Twenty years of plant genome sequencing: Achievements and challenges. Trends Plant Sci. 27, 391–401 (2022). - PubMed
    1. Soltis P. S., Soltis D. E., Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants People Planet 3, 74–82 (2021).
    1. Provart N. J., et al. , Anno genominis XX: 20 years of Arabidopsis genomics. Plant Cell 33, 832–845 (2021). - PMC - PubMed

LinkOut - more resources