Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 15:24:elae048.
doi: 10.1093/bfgp/elae048.

STLBRF: an improved random forest algorithm based on standardized-threshold for feature screening of gene expression data

Affiliations

STLBRF: an improved random forest algorithm based on standardized-threshold for feature screening of gene expression data

Huini Feng et al. Brief Funct Genomics. .

Abstract

When the traditional random forest (RF) algorithm is used to select feature elements in biostatistical data, a large amount of noise data and parameters can affect the importance of the selected feature elements, making the control of feature selection difficult. Therefore, it is a challenge for the traditional RF algorithm to preserve the accuracy of algorithm results in the presence of noise data. Generally, directly removing noise data can result in significant bias in the results. In this study, we develop a new algorithm, standardized threshold, and loops based random forest (STLBRF), and apply it to the field of gene expression data for feature gene selection. This algorithm, based on the traditional RF algorithm, combines backward elimination and K-fold cross-validation to construct a cyclic system and set a standardized threshold: error increment. The algorithm overcomes the shortcomings of existing gene selection methods. We compare ridge regression, lasso regression, elastic net regression, the traditional RF algorithm, and our improved RF algorithm using three real gene expression datasets and conducting a quantitative analysis. To ensure the reliability of the results, we validate the effectiveness of the genes selected by these methods using the Random Forest classifier. The results indicate that, compared to other methods, the STLBRF algorithm achieves not only higher effectiveness in feature gene selection but also better control over the number of selected genes. Our method offers reliable technical support for feature expression analysis and research on biomarker selection.

Keywords: biomarker; feature gene selection; improved random forest algorithm; noise data; standardized threshold.

PubMed Disclaimer

Figures

Figure 1
Figure 1
STLBRF algorithm filtering feature flowchart.
Figure 2
Figure 2
The classification accuracies based on five algorithms for tuberculosis dataset, which indicates the relationship between the accuracy of each algorithm and the number of the significant genes screened.
Figure 3
Figure 3
The classification accuracies based on five algorithms for gastric cancer microarray dataset, which indicates the relationship between the accuracy of each algorithm and the number of the significant genes screened.
Figure 4
Figure 4
The classification accuracies based on five algorithms for gastric cancer RNA-seq dataset, which indicates the relationship between the accuracy of each algorithm and the number of the significant genes screened.

Similar articles

Cited by

References

    1. Lu QF, Chen F, Li Q. et al. A machine learning method to trace cancer primary lesion using microarray-based gene expression data. Front Oncol 2022;12:12. - PMC - PubMed
    1. Nelson PT, Baldwin DA, Scearce LM. et al. Microarray-based, high-throughput gene expression profiling of microRNAs. Nat Methods 2004;1:155–61. - PubMed
    1. Su YS, Li YX, Zhang Z. et al. Features identification for phenotypic classification based on genes and gene pairs. Curr Bioinform 2018;13:468–78.
    1. Blohmke CJ, Muller J, Gibani MM. et al. Diagnostic host gene signature for distinguishing enteric fever from other febrile diseases. EMBO Mol Med 2019;11:11. - PMC - PubMed
    1. Zhang WW, Long HX, He BS. et al. DECtp: calling differential gene expression between cancer and normal samples by integrating tumor purity information. Front Genet 2018;9. 10.3389/fgene.2018.00321. - DOI - PMC - PubMed