. 2025 Mar 31;16(4):424.

doi: 10.3390/genes16040424.

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Hongwei Zhang¹, Yan Shi², Yapeng Wang¹, Xu Yang¹, Kefeng Li¹, Sio-Kei Im¹, Yu Han³

Affiliations

¹ Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR 999074, China.
² State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.
³ Faculty of Civil Engineering, Southwest Forestry University, Kunming 650224, China.

PMID: 40282384
PMCID: PMC12026654
DOI: 10.3390/genes16040424

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Hongwei Zhang et al. Genes (Basel). 2025.

. 2025 Mar 31;16(4):424.

doi: 10.3390/genes16040424.

Authors

Hongwei Zhang¹, Yan Shi², Yapeng Wang¹, Xu Yang¹, Kefeng Li¹, Sio-Kei Im¹, Yu Han³

Affiliations

¹ Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR 999074, China.
² State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China.
³ Faculty of Civil Engineering, Southwest Forestry University, Kunming 650224, China.

PMID: 40282384
PMCID: PMC12026654
DOI: 10.3390/genes16040424

Abstract

Background/Objectives: Long non-coding RNAs (lncRNAs) play a crucial regulatory role in plant trait expression and disease management, making their accurate prediction a key research focus for guiding biological experiments. While extensive studies have been conducted on animals and humans, plant lncRNA research remains relatively limited due to various challenges, such as data scarcity and genomic complexity. This study aims to bridge this gap by developing an effective computational method for predicting plant lncRNAs, specifically by classifying transcribed RNA sequences as lncRNAs or mRNAs using multi-feature analysis. Methods: We propose the lncRNA multi-feature-fusion ensemble learning (LMFE) approach, a novel method that integrates 100-dimensional features from RNA biological properties-based, sequence-based, and structure-based features, employing the XGBoost ensemble learning algorithm for prediction. To address unbalanced datasets, we implemented the synthetic minority oversampling technique (SMOTE). LMFE was validated across benchmark datasets, cross-species datasets, unbalanced datasets, and independent datasets. Results: LMFE achieved an accuracy of 99.42%, an F1_score of 0.99, and an MCC of 0.98 on the benchmark dataset, with robust cross-species performance (accuracy ranging from 89.30% to 99.81%). On unbalanced datasets, LMFE attained an average accuracy of 99.41%, representing a 12.29% improvement over traditional methods without SMOTE (average ACC of 87.12%). Compared to state-of-the-art methods, such as CPC2 and PLEKv2, LMFE consistently outperformed them across multiple metrics on independent datasets (with an accuracy ranging from 97.33% to 99.21%), with redundant features having minimal impact on performance. Conclusions: LMFE provides a highly accurate and generalizable solution for plant lncRNA prediction, outperforming existing methods through multi-feature fusion and ensemble learning while demonstrating robustness to redundant features. Despite its effectiveness, variations in performance across species highlight the necessity for future improvements in managing diverse plant genomes. This method represents a valuable tool for advancing plant lncRNA research and guiding biological experiments.

Keywords: LMFE; cross-species; ensemble learning; multi-feature fusion; plant lncRNA prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
The overall framework of the LMFE consists of three steps. The first step involves the data preparation stage, where we obtained the lncRNA data as positive samples and the mRNA sequences as negative samples. In the second step, we focus on sequence representation and feature extraction, comprehensively capturing sequence features by considering the biological properties of RNA, sequence-based features, and structure-based features. Finally, we constructed an extreme gradient boosting (XGBoost) [23] based on ensemble learning to predict lncRNAs.

**Figure 2**
The comparison results between XGBoost and other methods. (A) The comparison of the ROC curves and AUC values of XGBoost and other methods. (B) The ACC compared with other mainstream methods; it can be observed that XGBoost achieves good performance, while GBDT, BG, and SVM are slightly better than other methods. (C) The time consumed by different classifiers on the training set. From the figure, it can be seen that SVM takes the longest time, while XGBoost’s advantage lies in its efficient performance.

**Figure 3**
Illustrates the performance of the model under different feature fusions. Analysis of the evaluation metrics reveals that some metrics, such as ACC, SN, and SP, showed higher values for F1 compared to F2, F3, and F4. F2 yielded higher values than F3 and F4. These results indicate that sequence-based features have a significant impact on model performance. This could be attributed to studies that have observed the lack of secondary structure conservation in lncRNAs of certain species [37], suggesting that secondary structure may not be as important for predicting lncRNAs as previously believed. These findings are consistent with the results of this study.

**Figure 4**
The ranking of the top 20 most important features. The descriptions of the feature abbreviations are as follows: ORFs_Coverage (ORF coverage), ORFs_Count (ORF count), ORFs_Length (ORF length), SeqLength (sequence length), y_axis (y asix), Num_unpaired_bases (the number of unpaired bases), Num_Base_pairs (the number of base pairs), and x_axis, (x axis). The figure highlights that ORF-related features dominate, with ORF coverage being the most significant, followed by ORF count and ORF length, reflecting their critical role in distinguishing lncRNAs from mRNAs due to the former’s lower ORF presence. Tri-nucleotide compositions, such as UAG and UUG, also rank highly, indicating their relevance in capturing sequence-level differences, particularly since stop codons, such as UAG, are more common in mRNAs. Sequence length and nucleotide compositions (e.g., A, C, GA) contribute moderately, while structural features (e.g., num_unpaired_bases, Num_Base_pairs) and Z-curve components (y_axis, x_axis) have lower importance (~0.03–0.04), suggesting that sequence-based features are more discriminatory than structural ones for lncRNA prediction in plants.

**Figure 5**
Illustrates that, with the reintroduction of redundant features, the ACC and precision metrics show an overall upward trend, increasing from 96.30% and 95.45% to 99.41% and 99.32%, respectively, with improvement rates of 3.11% and 3.87%. However, slight fluctuations were observed; for instance, adding “num_unpaired_bases” increased the ACC to 96.44%, while precision slightly decreased to 95.38%, indicating potential fluctuations due to its direct correlation with RNA structural stability. The most significant improvement occurred after the addition of “Dorfs_length”, with the ACC increasing from 95.58% to 98.43%, reflecting the crucial role of ORF-related features in distinguishing lncRNAs from mRNAs. However, with the addition of feature “C”, slight fluctuations in ACC and precision were noted, decreasing from 98.84% and 98.88% to 98.62% and 98.40%, respectively. This indicates that these features may introduce noise or overfitting to certain samples, possibly due to their high correlation with existing features, such as nucleotide composition. As the reintroduction process neared its conclusion, with the introduction of features, such as “Dgcc” and “Duaa”, both ACC and precision were restored. The ACC ultimately stabilized at 99.32% to 99.40%, with precision stabilizing at around 99.32% to 99.41%, nearing the performance of all features, which had 99.32% ACC and 99.41% precision. This analysis indicates that, although some redundant features (such as “Dnum_au_pairs”) temporarily compromise precision, the overall trend supports their inclusion in the complete feature set, as they collectively enhance LMFE’s ability to capture subtle patterns in RNA sequences, especially when balanced with biologically significant features, such as ORF coverage. These findings emphasize the robustness of XGBoost in processing relevant features and suggest that careful feature selection can alleviate transient performance degradation. We will further explore this consideration in future work by integrating advanced feature selection techniques.

**Figure 6**
(A) The performance metrics of precision and recall for LMFE trained on the *A. thaliana* dataset and evaluated on other species. The LMFE demonstrates excellent performance on the *A. thaliana* dataset, with precision and recall values approaching 99.72% and 99.81%, respectively. This indicates a strong adaptability to the characteristics of the species. The verification results for other species reveal that the precision remains relatively stable, with a slight decrease observed in *S. lycopersicum*. The highest precision is recorded at 99.31% for *P. trichocarpa*, while the lowest precision is 88.23% for *S. mollendorffii*. In terms of recall, all species maintain values above 98%, with the highest recall at 99.68% for *G. sulphuraria* and the lowest recall at 95.78% for *S. mollendorffii*. Overall, LMFE exhibits a consistent trend in precision and recall across different species, with only minor fluctuations, suggesting its robustness and potential for broad applicability in various biological contexts. (B) The ROC curves for LMFE trained on the *A. thaliana* dataset and assessed across various species. The ROC curve serves as a graphical representation of performance, plotting the true positive rate against the false positive rate at various threshold settings. The ROC curve for *A. thaliana* is nearly perfect, signifying that LMFE excels at distinguishing positive samples with minimal false positives. The AUC value of 1.00 indicates that LMFE can correctly identify the majority of positive samples. Other species, such as *V. radiata* and *Z. mays*, also demonstrate strong performances, with AUC values of 1.00 and 0.99, respectively. This suggests that LMFE maintains a high level of accuracy for these species. However, the ROC curve for *S. moellendorffii* is comparatively lower, with an AUC value of 0.97. While still indicating good performance, this suggests that LMFE’s performance on this species is slightly less robust than on others, potentially due to differences in data characteristics. Overall, LMFE exhibits excellent training results on the *A. thaliana* dataset and demonstrates strong performance across different species.

**Figure 7**
The confusion matrix for LMFE’s performance on *G. max* dataset after applying SMOTE, illustrating its ability to classify lncRNA and mRNA sequences. After SMOTE, the dataset was balanced to include 4000 true lncRNA samples and 4000 true mRNA samples. The matrix shows that out of 4000 true lncRNA samples, 3976 were correctly predicted as lncRNA (true positives), while 24 were misclassified as mRNA (false negatives). Conversely, out of 4000 true mRNA samples, 3947 were correctly predicted as mRNA (true negatives), but 53 were misclassified as lncRNA (false positives). The high values along the diagonal (3976 and 3947) and the low off-diagonal values (24 and 53) indicate a low error rate, demonstrating that LMFE accurately distinguishes between lncRNA and mRNA in most cases, with strong recognition ability for both categories despite the unbalanced dataset after applying SMOTE. Experimental results for other species are shown in Figures S10–S12 in the Supplementary Materials.

**Figure 8**
(A) Demonstrates that in the *V. angularis*, the AUC value of LMFE is 1.00, indicating excellent performance in distinguishing positive and negative samples, achieving perfect true positive rates at nearly all thresholds. CPC2, LGC, and PlncRNA-Hdeep also show perfect capabilities. In contrast, CNCI obtained a slightly lower AUC value of 0.87. (B) Confirms that LMFE demonstrated superior performance across all metrics, achieving a score of 0.99, reflecting extremely high accuracy and comprehensiveness. LGC and CPC2 closely followed, with a precision of 0.99 and recall and F1_scores of 0.93 and 0.96, respectively, indicating a good balance. PlncRNA-Hdeep achieved a precision of 0.96, recall of 0.93, and F1_score of 0.95, showcasing its effectiveness. PLEKv2 achieved a precision of 0.91, recall of 0.83, and F1_score of 0.87; its low recall suggests that its ability to identify positive samples requires optimization. Conversely, CNCI exhibited the poorest performance, with a precision of 0.87, recall of only 0.75, and F1_score of 0.80, indicating significant deficiencies in identifying positive samples.

See this image and copyright information in PMC

References

1. Gauthier J., Vincent A.T., Charette S.J., Derome N. A brief history of bioinformatics. Brief. Bioinform. 2019;20:1981–1996. - PubMed
1. Liu S., Li X., Xie Q., Zhang S., Liang X., Li S., Zhang P. Identification of a lncRNA/circRNA-miRNA-mRNA network in Nasopharyngeal Carcinoma by deep sequencing and bioinformatics analysis. J. Cancer. 2024;15:1916. - PMC - PubMed
1. Hubé F., Francastel C. Coding and non-coding RNAs, the frontier has never been so blurred. Front. Genet. 2018;9:369172 - PMC - PubMed
1. Xu D., Yuan W., Fan C., Liu B., Lu M.Z., Zhang J. Opportunities and Challenges of Predictive Approaches for the Non-coding RNA in Plants. Front. Plant Sci. 2022;13:890663. - PMC - PubMed
1. Shi K., Liu T., Fu H., Li W., Zheng X. Genome-wide analysis of lncRNA stability in human. PLoS Comput. Biol. 2021;17:e1008918. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Affiliations

LMFE: A Novel Method for Predicting Plant LncRNA Based on Multi-Feature Fusion and Ensemble Learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous