Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Aug 22:2024.06.04.596709.
doi: 10.1101/2024.06.04.596709.

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Affiliations

Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model

Jingjing Zhai et al. bioRxiv. .

Update in

Abstract

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks, including predicting translation initiation/termination sites and splice donor and acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming the best existing DNA LM by 1.45 to 7.23-fold. PlantCaduceus is competitive to state-of-the-art protein LMs in terms of deleterious mutation identification, and is threefold better than PhyloP. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Fig 1.
Fig 1.. Overview of PlantCaduceus.
(A) Phylogenetic tree of 16 Angiosperm species used for pre-training the PlantCaduceus model. (B) The input for PlantCaduceus consists of 512-bp DNA sequences with 15% of positions randomly masked. The pre-training objective is cross-entropy loss on the masked positions. The sequences are processed through the bi-directional Caduceus architecture, which is based on the Mamba sequence operator—a recently proposed structured state space model. Caduceus also contains a reverse complement equivariance inductive bias. (C) UMAP visualization of embeddings from PlantCaduceus (32 layers) averaged over non-overlapping 100-bp windows along the sorghum genome without intergenic regions. (D) The same UMAP visualization as in (C) but with intergenic regions.
Fig 2.
Fig 2.. Modeling translation and transcription through fine-tuning PlantCaduceus.
(A) Fine-tuning strategy for PlantCaduceus: The weights of the pre-trained PlantCaduceus model are kept frozen during pre-training. The last hidden state of PlantCaduceus is then used as features for the XGBoost model. (B) Phylogenetic tree of species used for training, validation, and testing during the fine-tuning of PlantCaduceus. (C-F) Bar plots displaying the PRAUC scores for six species across four tasks: TIS (C), TTS (D), splice donor (E), and splice acceptor (F). The gene structures on the left illustrate how positive and negative samples are obtained for each classification task. Blue bars represent the PlantCaduceus model with 32 layers. Gray bars denote three DNA language models: NT-v2, AgroNT, and GPN. Light gray bars represent a traditional supervised model, a hybrid of CNN and LSTM. The gray dashed line in each panel indicates the baseline for each dataset, corresponding to the negative sample ratio.
Fig 3.
Fig 3.. Evolutionary constraint prediction.
(A) Illustration of the evolutionary conservation data curation. (B) Receiver operating characteristic (ROC) and (C) precision-recall (PR) curves of different models in sorghum. (D) ROC and (E) PR curves of transferring different models trained in sorghum to unseen maize data.
Fig 4.
Fig 4.. Deleterious mutations identification in maize.
(A) The zero-shot strategy of PlantCaduceus for identifying deleterious mutations. (B) The zero-shot score distribution of different types of variants generated by in silico mutagenesis in maize chromosome 8. (C) The zero-shot score distribution of 9.4M SNPs in the maize Hapmap3 population. (D) The MAF of putative deleterious mutations prioritized by different models in maize.
Fig 5.
Fig 5.. The causal mutation in Su1 locus.
(A) Manhattan plot of the sweet corn trait in the region from 43.0 to 46.0 Mb on chromosome 4. (B) The zero-shot scores of SNPs in 43.0 to 46.0 Mb in chromosome 4, corresponding to the same region as in (A). (C) Scatter plot of zero-shot scores from PlantCaduceus versus −log10(P) values from GWAS result. The horizontal dashed line indicates the GWAS significance threshold (Bonferroni’s threshold: 0.05/N; N=2,072,522), and the vertical dashed line marks the top 0.1% percentile of zero-shot scores. (D) Zoomed-in view of the causal variant region and the Su1 gene structure.

References

    1. One Thousand Plant Transcriptomes Initiative. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574, 679–685 (2019). - PMC - PubMed
    1. Marks R. A., Hotaling S., Frandsen P. B. & VanBuren R. Representation and participation across 20 years of plant genome sequencing. Nat Plants 7, 1571–1578 (2021). - PMC - PubMed
    1. Sun Y., Shang L., Zhu Q.-H., Fan L. & Guo L. Twenty years of plant genome sequencing: achievements and challenges. Trends Plant Sci. 27, 391–401 (2022). - PubMed
    1. Soltis P. S. & Soltis D. E. Plant genomes: Markers of evolutionary history and drivers of evolutionary change. Plants People Planet 3, 74–82 (2021).
    1. Provart N. J. et al. Anno genominis XX: 20 years of Arabidopsis genomics. Plant Cell 33, 832–845 (2021). - PMC - PubMed

Publication types