Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 27:15:1409755.
doi: 10.3389/fgene.2024.1409755. eCollection 2024.

Development and evaluation of a chronic kidney disease risk prediction model using random forest

Affiliations

Development and evaluation of a chronic kidney disease risk prediction model using random forest

Krish Mendapara. Front Genet. .

Abstract

This research aims to advance the detection of Chronic Kidney Disease (CKD) through a novel gene-based predictive model, leveraging recent breakthroughs in gene sequencing. We sourced and merged gene expression profiles of CKD-affected renal tissues from the Gene Expression Omnibus (GEO) database, classifying them into two sets for training and validation in a 7:3 ratio. The training set included 141 CKD and 33 non-CKD specimens, while the validation set had 60 and 14, respectively. The disease risk prediction model was constructed using the training dataset, while the validation dataset confirmed the model's identification capabilities. The development of our predictive model began with evaluating differentially expressed genes (DEGs) between the two groups. We isolated six genes using Lasso and random forest (RF) methods-DUSP1, GADD45B, IFI44L, IFI30, ATF3, and LYZ-which are critical in differentiating CKD from non-CKD tissues. We refined our random forest (RF) model through 10-fold cross-validation, repeated five times, to optimize the mtry parameter. The performance of our model was robust, with an average AUC of 0.979 across the folds, translating to a 91.18% accuracy. Validation tests further confirmed its efficacy, with a 94.59% accuracy and an AUC of 0.990. External validation using dataset GSE180394 yielded an AUC of 0.913, 89.83% accuracy, and a sensitivity rate of 0.889, underscoring the model's reliability. In summary, the study identified critical genetic biomarkers and successfully developed a novel disease risk prediction model for CKD. This model can serve as a valuable tool for CKD disease risk assessment and contribute significantly to CKD identification.

Keywords: CKD; biomarkers; chronic kidney disease; computational genomics and proteomics; differentially expressed genes (DEGs); disease risk prediction algorithm; random forest..

PubMed Disclaimer

Conflict of interest statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
This study’s process is outlined as follows. In the first step, datasets GSE35488, GSE32591, GSE66494, and GSE47184 were merged into a single comprehensive dataset. The second step involved dividing this extensive dataset into training and validation sets using stratified random sampling, adhering to a 7:3 ratio. The third step focused on the training dataset, where differential expression analysis was performed, Lasso regression and RF-RFE (Random Forest - Recursive Feature Elimination) were executed, and the feature importance score of RF was used to pinpoint essential genes. In the fourth step, these key genes were integrated into a random forest prediction model. The fifth step entailed evaluating the model’s effectiveness through 5-fold cross-validation on the training set. Additionally, the model’s robustness was tested using the validation set and an external validation dataset (GSE180394), with performance measured in terms of the area under the curve (AUC), accuracy, and sensitivity.
FIGURE 2
FIGURE 2
Differentially expressed genes. (A) In the volcano plot, 35 genes are highlighted for their significant differential expression, with green dots indicating upregulated genes, black dots for genes with no notable differences, and red dots for downregulated genes. (B) The heat map illustrates the expression patterns of these 35 genes, clearly showing trends of both upregulation and downregulation.
FIGURE 3
FIGURE 3
Enrichment Analysis. (A) This section features a bar plot representing biological processes derived from GO enrichment analysis. (B) It includes a bar plot depicting the results of KEGG enrichment analysis. (C) The section concludes with a bar plot showing the findings from DO enrichment analysis.
FIGURE 4
FIGURE 4
Feature selection. (A) Lasso regression curve depicting the 35 DEGs. (B) Options for the λ parameter in the 10-fold cross-validation. (C) RMSE values for the 10-fold cross-validation of the RF-RFE-selected signature gene combination. (D) Importance scores of genes in the random forests model. Development of the random forest model.
FIGURE 5
FIGURE 5
The ROC curve results were confirmed by a 5-fold cross-validation.
FIGURE 6
FIGURE 6
The performance of the random forest model was evaluated across the training (A), validation (B), and external validation (C) datasets, utilizing ROC curves and analyzing their respective AUC values.

Similar articles

Cited by

References

    1. Alshammri R., Alharbi G., Alharbi E., Almubark I. (2023). Machine learning approaches to identify Parkinson's disease using voice signal features. Front. Artif. Intell. 6, 1084001. 10.3389/frai.2023.1084001 - DOI - PMC - PubMed
    1. Aromolaran O., Aromolaran D., Isewon I., Oyelade J. (2021). Machine learning approach to gene essentiality prediction: a review. Brief. Bioinform 22 (5), bbab128. 10.1093/bib/bbab128 - DOI - PubMed
    1. Berthier C. C., Bethunaickan R., Gonzalez-Rivera T., Nair V., Ramanujam M., Zhang W., et al. (2012). Cross-species transcriptional network analysis defines shared inflammatory responses in murine and human lupus nephritis. J. Immunol. 189 (2), 988–1001. 10.4049/jimmunol.1103031 - DOI - PMC - PubMed
    1. Bruggeman L. A. (2007). Viral subversion mechanisms in chronic kidney disease pathogenesis. Clin. J. Am. Soc. Nephrol. 2 (1), S13–S19. 10.2215/CJN.04311206 - DOI - PMC - PubMed
    1. Chen H., Huang L., Jiang X., Wang Y., Yan B., Ma S., et al. (2022). Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest. Front. Immunol. 13, 1025688. 10.3389/fimmu.2022.1025688 - DOI - PMC - PubMed

LinkOut - more resources