Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 31;16(4):424.
doi: 10.3390/genes16040424.

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Affiliations

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Hongwei Zhang et al. Genes (Basel). .

Abstract

Background/Objectives: Long non-coding RNAs (lncRNAs) play a crucial regulatory role in plant trait expression and disease management, making their accurate prediction a key research focus for guiding biological experiments. While extensive studies have been conducted on animals and humans, plant lncRNA research remains relatively limited due to various challenges, such as data scarcity and genomic complexity. This study aims to bridge this gap by developing an effective computational method for predicting plant lncRNAs, specifically by classifying transcribed RNA sequences as lncRNAs or mRNAs using multi-feature analysis. Methods: We propose the lncRNA multi-feature-fusion ensemble learning (LMFE) approach, a novel method that integrates 100-dimensional features from RNA biological properties-based, sequence-based, and structure-based features, employing the XGBoost ensemble learning algorithm for prediction. To address unbalanced datasets, we implemented the synthetic minority oversampling technique (SMOTE). LMFE was validated across benchmark datasets, cross-species datasets, unbalanced datasets, and independent datasets. Results: LMFE achieved an accuracy of 99.42%, an F1score of 0.99, and an MCC of 0.98 on the benchmark dataset, with robust cross-species performance (accuracy ranging from 89.30% to 99.81%). On unbalanced datasets, LMFE attained an average accuracy of 99.41%, representing a 12.29% improvement over traditional methods without SMOTE (average ACC of 87.12%). Compared to state-of-the-art methods, such as CPC2 and PLEKv2, LMFE consistently outperformed them across multiple metrics on independent datasets (with an accuracy ranging from 97.33% to 99.21%), with redundant features having minimal impact on performance. Conclusions: LMFE provides a highly accurate and generalizable solution for plant lncRNA prediction, outperforming existing methods through multi-feature fusion and ensemble learning while demonstrating robustness to redundant features. Despite its effectiveness, variations in performance across species highlight the necessity for future improvements in managing diverse plant genomes. This method represents a valuable tool for advancing plant lncRNA research and guiding biological experiments.

Keywords: LMFE; cross-species; ensemble learning; multi-feature fusion; plant lncRNA prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
The overall framework of the LMFE consists of three steps. The first step involves the data preparation stage, where we obtained the lncRNA data as positive samples and the mRNA sequences as negative samples. In the second step, we focus on sequence representation and feature extraction, comprehensively capturing sequence features by considering the biological properties of RNA, sequence-based features, and structure-based features. Finally, we constructed an extreme gradient boosting (XGBoost) [23] based on ensemble learning to predict lncRNAs.
Figure 2
Figure 2
The comparison results between XGBoost and other methods. (A) The comparison of the ROC curves and AUC values of XGBoost and other methods. (B) The ACC compared with other mainstream methods; it can be observed that XGBoost achieves good performance, while GBDT, BG, and SVM are slightly better than other methods. (C) The time consumed by different classifiers on the training set. From the figure, it can be seen that SVM takes the longest time, while XGBoost’s advantage lies in its efficient performance.
Figure 3
Figure 3
Illustrates the performance of the model under different feature fusions. Analysis of the evaluation metrics reveals that some metrics, such as ACC, SN, and SP, showed higher values for F1 compared to F2, F3, and F4. F2 yielded higher values than F3 and F4. These results indicate that sequence-based features have a significant impact on model performance. This could be attributed to studies that have observed the lack of secondary structure conservation in lncRNAs of certain species [37], suggesting that secondary structure may not be as important for predicting lncRNAs as previously believed. These findings are consistent with the results of this study.
Figure 4
Figure 4
The ranking of the top 20 most important features. The descriptions of the feature abbreviations are as follows: ORFs_Coverage (ORF coverage), ORFs_Count (ORF count), ORFs_Length (ORF length), SeqLength (sequence length), y_axis (y asix), Num_unpaired_bases (the number of unpaired bases), Num_Base_pairs (the number of base pairs), and x_axis, (x axis). The figure highlights that ORF-related features dominate, with ORF coverage being the most significant, followed by ORF count and ORF length, reflecting their critical role in distinguishing lncRNAs from mRNAs due to the former’s lower ORF presence. Tri-nucleotide compositions, such as UAG and UUG, also rank highly, indicating their relevance in capturing sequence-level differences, particularly since stop codons, such as UAG, are more common in mRNAs. Sequence length and nucleotide compositions (e.g., A, C, GA) contribute moderately, while structural features (e.g., num_unpaired_bases, Num_Base_pairs) and Z-curve components (y_axis, x_axis) have lower importance (~0.03–0.04), suggesting that sequence-based features are more discriminatory than structural ones for lncRNA prediction in plants.
Figure 5
Figure 5
Illustrates that, with the reintroduction of redundant features, the ACC and precision metrics show an overall upward trend, increasing from 96.30% and 95.45% to 99.41% and 99.32%, respectively, with improvement rates of 3.11% and 3.87%. However, slight fluctuations were observed; for instance, adding “num_unpaired_bases” increased the ACC to 96.44%, while precision slightly decreased to 95.38%, indicating potential fluctuations due to its direct correlation with RNA structural stability. The most significant improvement occurred after the addition of “Dorfs_length”, with the ACC increasing from 95.58% to 98.43%, reflecting the crucial role of ORF-related features in distinguishing lncRNAs from mRNAs. However, with the addition of feature “C”, slight fluctuations in ACC and precision were noted, decreasing from 98.84% and 98.88% to 98.62% and 98.40%, respectively. This indicates that these features may introduce noise or overfitting to certain samples, possibly due to their high correlation with existing features, such as nucleotide composition. As the reintroduction process neared its conclusion, with the introduction of features, such as “Dgcc” and “Duaa”, both ACC and precision were restored. The ACC ultimately stabilized at 99.32% to 99.40%, with precision stabilizing at around 99.32% to 99.41%, nearing the performance of all features, which had 99.32% ACC and 99.41% precision. This analysis indicates that, although some redundant features (such as “Dnum_au_pairs”) temporarily compromise precision, the overall trend supports their inclusion in the complete feature set, as they collectively enhance LMFE’s ability to capture subtle patterns in RNA sequences, especially when balanced with biologically significant features, such as ORF coverage. These findings emphasize the robustness of XGBoost in processing relevant features and suggest that careful feature selection can alleviate transient performance degradation. We will further explore this consideration in future work by integrating advanced feature selection techniques.
Figure 6
Figure 6
(A) The performance metrics of precision and recall for LMFE trained on the A. thaliana dataset and evaluated on other species. The LMFE demonstrates excellent performance on the A. thaliana dataset, with precision and recall values approaching 99.72% and 99.81%, respectively. This indicates a strong adaptability to the characteristics of the species. The verification results for other species reveal that the precision remains relatively stable, with a slight decrease observed in S. lycopersicum. The highest precision is recorded at 99.31% for P. trichocarpa, while the lowest precision is 88.23% for S. mollendorffii. In terms of recall, all species maintain values above 98%, with the highest recall at 99.68% for G. sulphuraria and the lowest recall at 95.78% for S. mollendorffii. Overall, LMFE exhibits a consistent trend in precision and recall across different species, with only minor fluctuations, suggesting its robustness and potential for broad applicability in various biological contexts. (B) The ROC curves for LMFE trained on the A. thaliana dataset and assessed across various species. The ROC curve serves as a graphical representation of performance, plotting the true positive rate against the false positive rate at various threshold settings. The ROC curve for A. thaliana is nearly perfect, signifying that LMFE excels at distinguishing positive samples with minimal false positives. The AUC value of 1.00 indicates that LMFE can correctly identify the majority of positive samples. Other species, such as V. radiata and Z. mays, also demonstrate strong performances, with AUC values of 1.00 and 0.99, respectively. This suggests that LMFE maintains a high level of accuracy for these species. However, the ROC curve for S. moellendorffii is comparatively lower, with an AUC value of 0.97. While still indicating good performance, this suggests that LMFE’s performance on this species is slightly less robust than on others, potentially due to differences in data characteristics. Overall, LMFE exhibits excellent training results on the A. thaliana dataset and demonstrates strong performance across different species.
Figure 7
Figure 7
The confusion matrix for LMFE’s performance on G. max dataset after applying SMOTE, illustrating its ability to classify lncRNA and mRNA sequences. After SMOTE, the dataset was balanced to include 4000 true lncRNA samples and 4000 true mRNA samples. The matrix shows that out of 4000 true lncRNA samples, 3976 were correctly predicted as lncRNA (true positives), while 24 were misclassified as mRNA (false negatives). Conversely, out of 4000 true mRNA samples, 3947 were correctly predicted as mRNA (true negatives), but 53 were misclassified as lncRNA (false positives). The high values along the diagonal (3976 and 3947) and the low off-diagonal values (24 and 53) indicate a low error rate, demonstrating that LMFE accurately distinguishes between lncRNA and mRNA in most cases, with strong recognition ability for both categories despite the unbalanced dataset after applying SMOTE. Experimental results for other species are shown in Figures S10–S12 in the Supplementary Materials.
Figure 8
Figure 8
(A) Demonstrates that in the V. angularis, the AUC value of LMFE is 1.00, indicating excellent performance in distinguishing positive and negative samples, achieving perfect true positive rates at nearly all thresholds. CPC2, LGC, and PlncRNA-Hdeep also show perfect capabilities. In contrast, CNCI obtained a slightly lower AUC value of 0.87. (B) Confirms that LMFE demonstrated superior performance across all metrics, achieving a score of 0.99, reflecting extremely high accuracy and comprehensiveness. LGC and CPC2 closely followed, with a precision of 0.99 and recall and F1scores of 0.93 and 0.96, respectively, indicating a good balance. PlncRNA-Hdeep achieved a precision of 0.96, recall of 0.93, and F1score of 0.95, showcasing its effectiveness. PLEKv2 achieved a precision of 0.91, recall of 0.83, and F1score of 0.87; its low recall suggests that its ability to identify positive samples requires optimization. Conversely, CNCI exhibited the poorest performance, with a precision of 0.87, recall of only 0.75, and F1score of 0.80, indicating significant deficiencies in identifying positive samples.
Figure 8
Figure 8
(A) Demonstrates that in the V. angularis, the AUC value of LMFE is 1.00, indicating excellent performance in distinguishing positive and negative samples, achieving perfect true positive rates at nearly all thresholds. CPC2, LGC, and PlncRNA-Hdeep also show perfect capabilities. In contrast, CNCI obtained a slightly lower AUC value of 0.87. (B) Confirms that LMFE demonstrated superior performance across all metrics, achieving a score of 0.99, reflecting extremely high accuracy and comprehensiveness. LGC and CPC2 closely followed, with a precision of 0.99 and recall and F1scores of 0.93 and 0.96, respectively, indicating a good balance. PlncRNA-Hdeep achieved a precision of 0.96, recall of 0.93, and F1score of 0.95, showcasing its effectiveness. PLEKv2 achieved a precision of 0.91, recall of 0.83, and F1score of 0.87; its low recall suggests that its ability to identify positive samples requires optimization. Conversely, CNCI exhibited the poorest performance, with a precision of 0.87, recall of only 0.75, and F1score of 0.80, indicating significant deficiencies in identifying positive samples.

Similar articles

References

    1. Gauthier J., Vincent A.T., Charette S.J., Derome N. A brief history of bioinformatics. Brief. Bioinform. 2019;20:1981–1996. - PubMed
    1. Liu S., Li X., Xie Q., Zhang S., Liang X., Li S., Zhang P. Identification of a lncRNA/circRNA-miRNA-mRNA network in Nasopharyngeal Carcinoma by deep sequencing and bioinformatics analysis. J. Cancer. 2024;15:1916. - PMC - PubMed
    1. Hubé F., Francastel C. Coding and non-coding RNAs, the frontier has never been so blurred. Front. Genet. 2018;9:369172 - PMC - PubMed
    1. Xu D., Yuan W., Fan C., Liu B., Lu M.Z., Zhang J. Opportunities and Challenges of Predictive Approaches for the Non-coding RNA in Plants. Front. Plant Sci. 2022;13:890663. - PMC - PubMed
    1. Shi K., Liu T., Fu H., Li W., Zheng X. Genome-wide analysis of lncRNA stability in human. PLoS Comput. Biol. 2021;17:e1008918. - PMC - PubMed

LinkOut - more resources