Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Ilkin Bayramli^{1

2}, Victor Castro^{3

4}, Yuval Barak-Corren¹, Emily M Madsen^{5

6}, Matthew K Nock^{4

7

8}, Jordan W Smoller^#^{5

6

9}, Ben Y Reis^#^{10

11}

Affiliations

¹ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
² Harvard University, Cambridge, MA, USA.
³ Mass General Brigham Research Information Science and Computing, Boston, MA, USA.
⁴ Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁵ Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁶ Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁷ Department of Psychology, Harvard University, Cambridge, MA, USA.
⁸ Mental Health Research Program, Franciscan Children's, Brighton, MA, USA.
⁹ Harvard Medical School, Boston, MA, USA.
¹⁰ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA. ben_reis@harvard.edu.
¹¹ Harvard Medical School, Boston, MA, USA. ben_reis@harvard.edu.

^# Contributed equally.

PMID: 35087182
PMCID: PMC8795240
DOI: 10.1038/s41746-022-00558-0

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Ilkin Bayramli et al. NPJ Digit Med. 2022.

. 2022 Jan 27;5(1):15.

doi: 10.1038/s41746-022-00558-0.

Authors

Ilkin Bayramli^{1

2}, Victor Castro^{3

4}, Yuval Barak-Corren¹, Emily M Madsen^{5

6}, Matthew K Nock^{4

7

8}, Jordan W Smoller^#^{5

6

9}, Ben Y Reis^#^{10

11}

Affiliations

¹ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
² Harvard University, Cambridge, MA, USA.
³ Mass General Brigham Research Information Science and Computing, Boston, MA, USA.
⁴ Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁵ Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
⁶ Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA.
⁷ Department of Psychology, Harvard University, Cambridge, MA, USA.
⁸ Mental Health Research Program, Franciscan Children's, Brighton, MA, USA.
⁹ Harvard Medical School, Boston, MA, USA.
¹⁰ Predictive Medicine Group, Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA. ben_reis@harvard.edu.
¹¹ Harvard Medical School, Boston, MA, USA. ben_reis@harvard.edu.

^# Contributed equally.

PMID: 35087182
PMCID: PMC8795240
DOI: 10.1038/s41746-022-00558-0

Abstract

Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p < 0.001), likely due to the RF model's ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.

PubMed Disclaimer

Conflict of interest statement

Dr. Smoller reported serving as an unpaid member of the Bipolar/Depression Research Community Advisory Panel of 23andMe and a member of the Leon Levy Foundation Neuroscience Advisory Board, and receiving an honorarium for an internal seminar at Biogen Inc. Dr. Nock receives textbook royalties from Macmillan and Pearson publishers and has been a paid consultant in the past year for Microsoft and for a legal case regarding a death by suicide. He is an unpaid scientific advisor for TalkLife and Empatica. The remaining authors declare no competing interests.

Figures

**Fig. 1. Information overlap in EHR data.**
Electronic health records contain both structured and unstructured data. These two types of data contain both unique and overlapping information.

**Fig. 2. Data and modeling workflow.**
The diagram describes the filtering and processing steps taken to arrive at the final datasets used for training and testing different models described in this paper. STR—Structured Data; NLP—Unstructured data processed by Natural Language Processing; NBC—Naïve Bayesian Classifier; BRFC—Balanced Random Forest Classifier.

**Fig. 3. Distribution of time between penultimate hospital visit and first suicide attempt, in days.**
As the distribution was highly skewed, the x-axis was capped at 100 days for clarity. A few patients had several years between their last recorded visit and suicide attempt.

**Fig. 4. Performance of NBC and BRFC models, by type of data used.**
BRFC models perform considerably better than NBC models in terms of AUC across all three datasets. Combining structured and unstructured data yields better performance than using structured data alone, which itself performs better than using unstructured data only.

**Fig. 5. Interaction heterogeneity versus joint suicide risk.**
A comparison of joint suicide attempt risk and interaction heterogeneity. Each data point corresponds to a structured-unstructured feature pair AB. The x-axis shows the joint suicide risk of features A and B, defined as the log of the ratio of the expected joint occurrences of AB in the case vs. non case cohorts. The y-axis shows the interaction heterogeneity, a measure of how much the interaction between A and B differs between case and non-case cohorts. The plot shows that feature pairs with similar joint suicide attempt risk can have very different interaction heterogeneity.

See this image and copyright information in PMC

References

1. Tsui FR, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open. 2021;4:ooab011. doi: 10.1093/jamiaopen/ooab011. - DOI - PMC - PubMed
1. McCoy TH, Jr., Castro VM, Roberson AM, Snapper LA, Perlis RH. Improving prediction of suicide and accidental death after discharge from general hospitals with natural language processing. JAMA Psychiatry. 2016;73:1064–1071. doi: 10.1001/jamapsychiatry.2016.2172. - DOI - PMC - PubMed
1. Glenn CR, Nock MK. Improving the short-term prediction of suicidal behavior. Am. J. Prev. Med. 2014;47:S176–S180. doi: 10.1016/j.amepre.2014.06.004. - DOI - PMC - PubMed
1. Poulin C, et al. Predicting the risk of suicide by analyzing the text of clinical notes. PLoS One. 2014;9:e85733. doi: 10.1371/journal.pone.0085733. - DOI - PMC - PubMed
1. Gulati, G., Cullen, W. & Kelly, B. Psychiatry Algorithms for Primary Care (John Wiley & Sons, 2021).

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Affiliations

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources