Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun;24(2):156-69.
doi: 10.1002/mpr.1463. Epub 2015 May 21.

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Affiliations

Prediction of remission in obsessive compulsive disorder using a novel machine learning strategy

Kathleen D Askland et al. Int J Methods Psychiatr Res. 2015 Jun.

Abstract

The study objective was to apply machine learning methodologies to identify predictors of remission in a longitudinal sample of 296 adults with a primary diagnosis of obsessive compulsive disorder (OCD). Random Forests is an ensemble machine learning algorithm that has been successfully applied to large-scale data analysis across vast biomedical disciplines, though rarely in psychiatric research or for application to longitudinal data. When provided with 795 raw and composite scores primarily from baseline measures, Random Forest regression prediction explained 50.8% (5000-run average, 95% bootstrap confidence interval [CI]: 50.3-51.3%) of the variance in proportion of time spent remitted. Machine performance improved when only the most predictive 24 items were used in a reduced analysis. Consistently high-ranked predictors of longitudinal remission included Yale-Brown Obsessive Compulsive Scale (Y-BOCS) items, NEO items and subscale scores, Y-BOCS symptom checklist cleaning/washing compulsion score, and several self-report items from social adjustment scales. Random Forest classification was able to distinguish participants according to binary remission outcomes with an error rate of 24.6% (95% bootstrap CI: 22.9-26.2%). Our results suggest that clinically-useful prediction of remission may not require an extensive battery of measures. Rather, a small set of assessment items may efficiently distinguish high- and lower-risk patients and inform clinical decision-making.

Keywords: obsessive compulsive disorder; risk factors; statistics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Multidimensional scaling (MDS) plot: predicting Percent Time Remitted. Full (p = 795) Feature Set (points colored by binary outcome, Ever Remit). Sample MDS plot derived from a single random forest (RF) run under full feature analysis predicting the continuous outcome, Percent Time Remitted. For visualization purposes, the points (each of which corresponds to a single subject) are colored according to the binary outcome, Ever Remit.
Figure 2
Figure 2
Multidimensional scaling (MDS) plot: predicting Percent Time Remitted. Points colored by Neuroticism Subscale Score (NEO) and Degree of Interference due to Compulsions (Y‐BOCS). Sample MDS plot derived from a single random forest (RF) run under full feature analysis predicting the continuous outcome, Percent Time Remitted. This plot contains the identical points as in Figure 1. However, in this plot, the points are colored according to the subject's scores on two high‐ranked predictor items: a binary partition of the neuroticism subscale score (“lower neuroticism” corresponds to a neuroticism subscale score ≤ 50; “higher neuroticism” indicates > 50); a binary partition of the Y‐BOCS item #7, Interference due to compulsive behaviors (“Mild interference” corresponds to score ≤ 1, “Mod‐Severe interference” corresponds to a score > 1).
Figure 3
Figure 3
“Representative Tree”: predicting Percent Time Remitted using 24 best predictors. This representative tree models the continuous outcome, Percent Time Remitted, and the 24 high‐priority features and was extracted from a single random forest (RF) run (ntree = 5000) using the R “reprtree” (Dasgupta, 2014) package. This package implements the concept of representative trees from ensembles of tree‐based machines on the basis of several tree distance metrics (Banerjee et al., 2012). Each node contains the variable selected for splitting at that node and the value on which it was split represented by a mathematical condition. The cases split to the left daughter node are those for which the condition was met; those in the right node are those for which the condition was not met. The numeric values displayed at each terminal node are the mean values of the outcome variable for the subjects residing in that terminal node.

References

    1. Arnold S.E., Xie S.X., Leung Y.Y., Wang L.S., Kling M.A., Han X., Kim E.J., Wolk D.A., Bennett D.A., Chen‐Plotkin A., Grossman M., Hu W., Lee V.M., Mackin R.S., Trojanowski J.Q., Wilson R.S., Shaw L.M. (2012) Plasma biomarkers of depressive symptoms in older adults. Translational Psychiatry, 2(1), e65 DOI: 10.1038/tp.2011.63 - DOI - PMC - PubMed
    1. Banerjee M., Ding Y., Noone A.M. (2012) Identifying representative trees from ensembles. Statistics in Medicine, 31(15), 1601–1616. DOI: 10.1002/sim.4492 - DOI - PubMed
    1. Biau G. (2012) Analysis of a Random Forests model. Journal of Machine Learning Research, 13(1), 1063–1095.
    1. Biau G., Devroye L., Lugosi G. (2008) Consistency of Random Forests and other averaging classifiers. Journal of Machine Learning Research, 9, 2015–2033.
    1. Biener L., Abrams D.B. (1991) The contemplation ladder: Validation of a measure of readiness to consider smoking cessation. Health Psychology, 10(5), 360–365. - PubMed

Publication types