Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 27;5(1):15.
doi: 10.1038/s41746-022-00558-0.

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Affiliations

Predictive structured-unstructured interactions in EHR models: A case study of suicide prediction

Ilkin Bayramli et al. NPJ Digit Med. .

Abstract

Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p < 0.001), likely due to the RF model's ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.

PubMed Disclaimer

Conflict of interest statement

Dr. Smoller reported serving as an unpaid member of the Bipolar/Depression Research Community Advisory Panel of 23andMe and a member of the Leon Levy Foundation Neuroscience Advisory Board, and receiving an honorarium for an internal seminar at Biogen Inc. Dr. Nock receives textbook royalties from Macmillan and Pearson publishers and has been a paid consultant in the past year for Microsoft and for a legal case regarding a death by suicide. He is an unpaid scientific advisor for TalkLife and Empatica. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Information overlap in EHR data.
Electronic health records contain both structured and unstructured data. These two types of data contain both unique and overlapping information.
Fig. 2
Fig. 2. Data and modeling workflow.
The diagram describes the filtering and processing steps taken to arrive at the final datasets used for training and testing different models described in this paper. STR—Structured Data; NLP—Unstructured data processed by Natural Language Processing; NBC—Naïve Bayesian Classifier; BRFC—Balanced Random Forest Classifier.
Fig. 3
Fig. 3. Distribution of time between penultimate hospital visit and first suicide attempt, in days.
As the distribution was highly skewed, the x-axis was capped at 100 days for clarity. A few patients had several years between their last recorded visit and suicide attempt.
Fig. 4
Fig. 4. Performance of NBC and BRFC models, by type of data used.
BRFC models perform considerably better than NBC models in terms of AUC across all three datasets. Combining structured and unstructured data yields better performance than using structured data alone, which itself performs better than using unstructured data only.
Fig. 5
Fig. 5. Interaction heterogeneity versus joint suicide risk.
A comparison of joint suicide attempt risk and interaction heterogeneity. Each data point corresponds to a structured-unstructured feature pair AB. The x-axis shows the joint suicide risk of features A and B, defined as the log of the ratio of the expected joint occurrences of AB in the case vs. non case cohorts. The y-axis shows the interaction heterogeneity, a measure of how much the interaction between A and B differs between case and non-case cohorts. The plot shows that feature pairs with similar joint suicide attempt risk can have very different interaction heterogeneity.

References

    1. Tsui FR, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open. 2021;4:ooab011. doi: 10.1093/jamiaopen/ooab011. - DOI - PMC - PubMed
    1. McCoy TH, Jr., Castro VM, Roberson AM, Snapper LA, Perlis RH. Improving prediction of suicide and accidental death after discharge from general hospitals with natural language processing. JAMA Psychiatry. 2016;73:1064–1071. doi: 10.1001/jamapsychiatry.2016.2172. - DOI - PMC - PubMed
    1. Glenn CR, Nock MK. Improving the short-term prediction of suicidal behavior. Am. J. Prev. Med. 2014;47:S176–S180. doi: 10.1016/j.amepre.2014.06.004. - DOI - PMC - PubMed
    1. Poulin C, et al. Predicting the risk of suicide by analyzing the text of clinical notes. PLoS One. 2014;9:e85733. doi: 10.1371/journal.pone.0085733. - DOI - PMC - PubMed
    1. Gulati, G., Cullen, W. & Kelly, B. Psychiatry Algorithms for Primary Care (John Wiley & Sons, 2021).