Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 1:13:1025688.
doi: 10.3389/fimmu.2022.1025688. eCollection 2022.

Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest

Affiliations

Establishment and analysis of a disease risk prediction model for the systemic lupus erythematosus with random forest

Huajian Chen et al. Front Immunol. .

Abstract

Systemic lupus erythematosus (SLE) is a latent, insidious autoimmune disease, and with the development of gene sequencing in recent years, our study aims to develop a gene-based predictive model to explore the identification of SLE at the genetic level. First, gene expression datasets of SLE whole blood samples were collected from the Gene Expression Omnibus (GEO) database. After the datasets were merged, they were divided into training and validation datasets in the ratio of 7:3, where the SLE samples and healthy samples of the training dataset were 334 and 71, respectively, and the SLE samples and healthy samples of the validation dataset were 143 and 30, respectively. The training dataset was used to build the disease risk prediction model, and the validation dataset was used to verify the model identification ability. We first analyzed differentially expressed genes (DEGs) and then used Lasso and random forest (RF) to screen out six key genes (OAS3, USP18, RTP4, SPATS2L, IFI27 and OAS1), which are essential to distinguish SLE from healthy samples. With six key genes incorporated and five iterations of 10-fold cross-validation performed into the RF model, we finally determined the RF model with optimal mtry. The mean values of area under the curve (AUC) and accuracy of the models were over 0.95. The validation dataset was then used to evaluate the AUC performance and our model had an AUC of 0.948. An external validation dataset (GSE99967) with an AUC of 0.810, an accuracy of 0.836, and a sensitivity of 0.921 was used to assess the model's performance. The external validation dataset (GSE185047) of all SLE patients yielded an SLE sensitivity of up to 0.954. The final high-throughput RF model had a mean value of AUC over 0.9, again showing good results. In conclusion, we identified key genetic biomarkers and successfully developed a novel disease risk prediction model for SLE that can be used as a new SLE disease risk prediction aid and contribute to the identification of SLE.

Keywords: GEO; Lasso; disease risk prediction model; random forest; systemic lupus erythematosus.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The flow chart of this study. Step 1: We merged the GSE50635, GSE61635, GSE138458 and GSE154851 datasets into a large dataset. Step 2: Stratified random sampling methods were used by us for the partitioning of large data sets, and the ratio of the training dataset to the validation dataset was 7:3. Step 3: We performed differential expression analysis, Lasso regression, RF-RFE and feature importance score of RF in the training dataset to screen key genes. Step 4: A random forest prediction model was constructed through the inclusion of key genes. Step 5: We used 10-fold cross-validation to check the robustness of the training dataset and validated the model using the validation dataset and two external validation datasets to obtain the AUC, accuracy, and sensitivity. Step 6: In high-throughput sequencing (GSE72509, GSE110685 and GSE112087), we directly incorporated key genes using 10-fold cross-validation to demonstrate that the random forest prediction model is equally well robust in the context of high-throughput sequencing.
Figure 2
Figure 2
Differential genes. (A) Volcano diagram with 22 genes with significant differences, red dots indicate up-regulated genes, black dots indicate non-differentiated genes, and green dots indicate down-regulated genes. (B) Heat map of 22 differential genes with upregulation trends.
Figure 3
Figure 3
Enrichment Analysis. (A) Ring diagram of biological processes analyzed by GO enrichment. (B) Ring diagram of KEGG enrichment analysis. (C) Ring diagram of DO enrichment analysis. (D) About pathway-related and immune-related GSEA.
Figure 4
Figure 4
Feature selection. (A) The lasso regression curve of 22 DEGs. (B) The 10-fold cross-validation parameter (λ) options. (C) The 10-fold cross-validation of RMSE of signature gene combination of RF-RFE. (D) Gene importance scores for random forests.
Figure 5
Figure 5
The 10-fold cross-validation verifies ROC curve results.
Figure 6
Figure 6
The ROC curves and their respective AUC values were used to evaluate the performance of the random forest model on the training (A), validation (B) and external validation (C) datasets.
Figure 7
Figure 7
The ROC curve results were verified by 10-fold cross-validation under high-throughput conditions.

Similar articles

Cited by

References

    1. Dema B, Charles N. Autoantibodies in SLE: Specificities, isotypes and receptors. Antib (Basel) (2016) 5(1):2. doi: 10.3390/antib5010002 - DOI - PMC - PubMed
    1. Durcan L, O’Dwyer T, Petri M. Management strategies and future directions for systemic lupus erythematosus in adults. Lancet (2019) 393(10188):2332–43. doi: 10.1016/S0140-6736(19)30237-5 - DOI - PubMed
    1. Choi J, Kim ST, Craft J. The pathogenesis of systemic lupus erythematosus-an update. Curr Opin Immunol (2012) 24(6):651–7. doi: 10.1016/j.coi.2012.10.004 - DOI - PMC - PubMed
    1. Kiriakidou M, Ching CL. Systemic lupus erythematosus. Ann Intern Med (2020) 172(11):Itc81–itc96. doi: 10.7326/AITC202006020 - DOI - PubMed
    1. Yu H, Nagafuchi Y, Fujio K. Clinical and immunological biomarkers for systemic lupus erythematosus. Biomolecules (2021) 11(7):928. doi: 10.3390/biom11070928 - DOI - PMC - PubMed

Publication types

Substances