Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 10;6(3):101223.
doi: 10.1016/j.xplc.2024.101223. Epub 2024 Dec 16.

Cropformer: An interpretable deep learning framework for crop genomic prediction

Affiliations

Cropformer: An interpretable deep learning framework for crop genomic prediction

Hao Wang et al. Plant Commun. .

Abstract

Machine learning and deep learning are extensively employed in genomic selection (GS) to expedite the identification of superior genotypes and accelerate breeding cycles. However, a significant challenge with current data-driven deep learning models in GS lies in their low robustness and poor interpretability. To address these challenges, we developed Cropformer, a deep learning framework for predicting crop phenotypes and exploring downstream tasks. This framework combines convolutional neural networks with multiple self-attention mechanisms to improve accuracy. The ability of Cropformer to predict complex phenotypic traits was extensively evaluated on more than 20 traits across five major crops: maize, rice, wheat, foxtail millet, and tomato. Evaluation results show that Cropformer outperforms other GS methods in both precision and robustness, achieving up to a 7.5% improvement in prediction accuracy compared to the runner-up model. Additionally, Cropformer enhances the analysis and mining of genes associated with traits. We identified numerous single nucleotide polymorphisms (SNPs) with potential effects on maize phenotypic traits and revealed key genetic variations underlying these differences. Cropformer represents a significant advancement in predictive performance and gene identification, providing a powerful general tool for improving genomic design in crop breeding. Cropformer is freely accessible at https://cgris.net/cropformer.

Keywords: deep learning; genomic selection; multiple self-attention mechanisms; phenotypic prediction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow of the proposed Cropformer framework. (A) Collection of genotype information for five crop species. The genotype information was converted to a "one-hot code" representation and input into the neural network for trait prediction. (B) The Cropformer model mainly consists of a CNN layer and a multi-head self-attention layer. The CNN layer is used to capture the localization signals of SNPs, while multi-head self-attention makes the model more focused on important SNPs. (C) From left to right, the sequence shows the results of haplotype analysis, attention weight visualization, feature importance assessment (SHapley Additive exPlanations [SHAP]-based explanation of machine learning model outputs), and clustering analysis.
Figure 2
Figure 2
Predictive performance of the Cropformer model on maize data (Train and Test datasets, regression task). (A) Phenotypic distributions of ear weight (EW), plant height (PH), and days to tasseling (DTT) in the maize training and test datasets. (B) Comparison of the predictive performance of different models on DTT, PH, and EW in maize across the training (nested cross-validation) and test datasets. These models include Cropformer, CropGBM, DNNGP, XGBoost, SVR, MLP, rrBLUP, and DEM. Model performance was assessed using the Pearson correlation coefficient.
Figure 3
Figure 3
Predictive performance of the Cropformer model on the test datasets of wheat, foxtail millet, rice and tomato (continuous traits, regression tasks). (A) Comparison of predictive performance of different algorithms for thousand-kernel weight (TKW), grain width (GW), grain hardness (GH), grain protein (GP), and grain length (GL) on the wheat dataset. (B) Comparison of prediction performance of different algorithms for straw weight data from Anyang, Beijing, Changzhi, Dingxi, and Urumqi on the foxtail millet dataset. (C) Predictive performance of different algorithms for Culm_length, Days_to_heading_2018H, Grain_length_width_ratio, Plant_height_2018HN, and Thousand_grain_weight on the rice dataset. (D) Comparison of modeling performance of different algorithms for the Sopim_BGV006775_12T001232 trait on the tomato dataset based on genomic variation information including single nucleotide polymorphisms (SNPs), insertions and deletions (InDels), gene expression (GE), structural variations (SV), and the fusion of these four types of information.
Figure 4
Figure 4
Classification prediction performance of the Cropformer model on the maize dataset (10 000 SNPs, classification task). (A) UMAP visualization of all SNPs and the 10 000 SNPs extracted from the MIC. From left to right, SNPs are categorized into three and two classifications, respectively. (B) Comparison of the accuracy of different models on the maize training (nested cross-validation) and test datasets. (C) Comprehensive evaluation of the Cropformer model’s predictive performance on a maize test dataset using five metrics: Accuracy, Precision, Recall, F1_score, and Area under the curve (AUC). (D) Comparison of different models for classification of early flowering time (first 25% DTT), moderate flowering time (25%–75% DTT), and late flowering time (last 25% DTT) using 10 000 SNPs. The numbers in brackets represent AUC values.
Figure 5
Figure 5
Inferring the contribution of SNPs to GS by Cropformer in regression tasks. (A) Mapping of attentional weights to SNPs for maize DTT traits (Regression). The x axis represents the SNP index position; the y axis represents attentional weights. Only SNPs with attention weights greater than 1 are shown. (B) Comparison of traits among haplotypes. Shown are DTT comparisons among accessions harboring different haplotypes of Zm00001d008941 and Zm00001d011956. (C) Haplotype network for Zm00001d008941. Circles represent haplotypes, which are linked to their most similar relatives. Short lines indicate the diversity among linked haplotypes. (D) Gene structure and haplotypes of Zm00001d008941 in maize. The consensus genotype for each haplotype is marked in gray, light blue, and dark blue to represent the reference genotype, heterozygous mutation, and homozygous mutation, respectively. The purple bar graph shows the feature importance analysis based on XGBoost (Regression). (E) Haplotype network for Zm00001d011956. (F) Gene structure and haplotypes of Zm00001d011956 in maize. The purple bar graph shows the feature importance analysis based on XGBoost Regression.
Figure 6
Figure 6
Cropformer web servers.

References

    1. Albanese D., Filosi M., Visintainer R., Riccadonna S., Jurman G., Furlanello C. Minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics. 2013;29:407–408. doi: 10.1093/bioinformatics/bts707. - DOI - PubMed
    1. Berr A., Xu L., Gao J., Cognat V., Steinmetz A., Dong A., Shen W.H. SET DOMAIN GROUP25 encodes a histone methyltransferase and is involved in FLOWERING LOCUS C activation and repression of flowering. Plant Physiol. 2009;151:1476–1485. doi: 10.1104/pp.109.143941. - DOI - PMC - PubMed
    1. Bezerra I.C., Michaels S.D., Schomburg F.M., Amasino R.M. Lesions in the mRNA cap-binding gene suppress -mediated delayed flowering in Arabidopsis. Plant J. 2004;40:112–119. doi: 10.1111/j.1365-313X.2004.02194.x. - DOI - PubMed
    1. Cawley G.C., Talbot N.L. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010;11:2079–2107.
    1. Chen L.-Q., Luo J.-H., Cui Z.-H., Xue M., Wang L., Zhang X.-Y., Pawlowski W.P., He Y. ATX3, ATX4, and ATX5 Encode Putative H3K4 Methyltransferases and Are Critical for Plant Development. Plant Physiol. 2017;174:1795–1806. doi: 10.1104/pp.16.01944. - DOI - PMC - PubMed

LinkOut - more resources