Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 3;9(1):9617.
doi: 10.1038/s41598-019-45989-0.

Machine learning approaches to predict lupus disease activity from gene expression data

Affiliations

Machine learning approaches to predict lupus disease activity from gene expression data

Brian Kegerreis et al. Sci Rep. .

Abstract

The integration of gene expression data to predict systemic lupus erythematosus (SLE) disease activity is a significant challenge because of the high degree of heterogeneity among patients and study cohorts, especially those collected on different microarray platforms. Here we deployed machine learning approaches to integrate gene expression data from three SLE data sets and used it to classify patients as having active or inactive disease as characterized by standard clinical composite outcome measures. Both raw whole blood gene expression data and informative gene modules generated by Weighted Gene Co-expression Network Analysis from purified leukocyte populations were employed with various classification algorithms. Classifiers were evaluated by 10-fold cross-validation across three combined data sets or by training and testing in independent data sets, the latter of which amplified the effects of technical variation. A random forest classifier achieved a peak classification accuracy of 83 percent under 10-fold cross-validation, but its performance could be severely affected by technical variation among data sets. The use of gene modules rather than raw gene expression was more robust, achieving classification accuracies of approximately 70 percent regardless of how the training and testing sets were formed. Fine-tuning the algorithms and parameter sets may generate sufficient accuracy to be informative as a standalone estimate of disease activity.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Heatmaps of −log10(overlap p values) from RRHO. Strongest overlaps near the center of each plot indicate weak agreement among the most significantly upregulated and downregulated genes from each data set. Strong agreement between data sets should form a diagonal from the bottom-left corner to the top-right corner.
Figure 2
Figure 2
Clustering all three studies on three consistent DE genes. DNAJC13, IRF4, and RPL22 were consistently differentially expressed in each study yet fail to fully separate active from inactive patients. Orange bars denote active patients; black bars denote inactive patients. Blue, yellow, and red bars denote patients from GSE39088, GSE45291, and GSE49454, respectively.
Figure 3
Figure 3
Cellular gene modules provide the basis for machine learning predictions of SLE activity. GSVA was performed on three SLE WB datasets using 25 WGCNA modules made from purified SLE cells with correlation or published relationship to SLEDAI (See Table 2). Orange: active patient; black: inactive patient. LDG: low-density granulocyte; PC: plasma cell.
Figure 4
Figure 4
Individual WGCNA modules are ineffective at separating active and inactive SLE subjects. GSVA enrichment scores for (a) CD4_Floralwhite and (b) CD4_Orangered4 in SLE WB are unable to fully separate active patients from inactive patients. Asterisks denote significant differences by Welch’s t-test. Error bars indicate mean ± standard deviation.
Figure 5
Figure 5
Performance of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and evaluated in the data sets listed across the bottom. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.
Figure 6
Figure 6
Area under the ROC curve of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and tested in the other two data sets. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.
Figure 7
Figure 7
Random forest classifier reveals variable importance of genes and modules. (a) Variable importance of top 25 individual genes as determined by mean decrease in Gini impurity. (b) Variable importance of cell modules. (c) As many modules shared genes, modules were deduplicated to determine the effects on the random forest classifier. The relative importance of the full modules and deduplicated modules was strongly correlated (Spearman’s rho = 0.69, p = 1.94E-4). LDG: low-density granulocyte; PC: plasma cell.

References

    1. Karrar S, Cunninghame Graham DS. Abnormal B-cell development in systemic lupus erythematosus: what the genetics tell us. Arthritis Rheumatol. 2018;70:496–507. doi: 10.1002/art.40396. - DOI - PMC - PubMed
    1. Lugar PL, Love C, Grammer AC, Dave SS, Lipsky PE. Molecular characterization of circulating plasma cells in patients with systemic lupus erythematosus. PLoS One. 2012;7:e44362. doi: 10.1371/journal.pone.0044362. - DOI - PMC - PubMed
    1. Vaughn SE, et al. Lupus risk variants in the PXK locus alter B-cell receptor internalization. Front. Genet. 2015;5:450. doi: 10.3389/fgene.2014.00450. - DOI - PMC - PubMed
    1. Bengtsson AA, Rönnblom L. Role of interferons in SLE. Best Pract. Res. Clin. Rheumatol. 2017;31:415–428. doi: 10.1016/j.berh.2017.10.003. - DOI - PubMed
    1. Catalina, M. D., Bachali, P., Geraci, N. S., Grammer, A. C. & Lipsky, P. E. Gene expression analysis delineates the potential roles of multiple interferons in systemic lupus erythematosus. Communications Biology2(1) (2019). - PMC - PubMed