Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2022 Jun 14;29(7):1292-1302.
doi: 10.1093/jamia/ocac058.

Use of unstructured text in prognostic clinical prediction models: a systematic review

Affiliations
Meta-Analysis

Use of unstructured text in prognostic clinical prediction models: a systematic review

Tom M Seinen et al. J Am Med Inform Assoc. .

Abstract

Objective: This systematic review aims to assess how information from unstructured text is used to develop and validate clinical prognostic prediction models. We summarize the prediction problems and methodological landscape and determine whether using text data in addition to more commonly used structured data improves the prediction performance.

Materials and methods: We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic prediction models using information extracted from unstructured text in a data-driven manner, published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models.

Results: We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared with using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and attention for the explainability of the developed models were limited.

Conclusion: The use of unstructured text in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The text data are source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.

Keywords: clinical prediction model; electronic health records; machine learning; natural language processing; prognostic prediction.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Visualization of the prognostic prediction problem. The objective is to predict which patients from a target population will experience an outcome event within a prediction horizon, using predictors only measured in an observation window before the time of prediction. Predictors can be extracted from both the structured data and text data.
Figure 2.
Figure 2.
Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram with the search and screening results of the systematic review.
Figure 3.
Figure 3.
Sankey diagram of the different categories of target populations and clinical outcomes, and clinical outcomes and prediction horizons, ordered by size. The number in parentheses indicates the number of prediction problems with these categories and the width of the connection between 2 categories represents the number of prediction problems with this combination of categories.
Figure 4.
Figure 4.
(A) Boxplots of the number of observations (left) and outcome cases (right) of 145 prediction problems. (B) Boxplot of the ratios between the number of observations and outcome cases. In both (A) and (B), the mean is indicated by the diamond and the points represent the underlying data.
Figure 5.
Figure 5.
(A) The use of different text representations (TR) and machine learning (ML) methods in text-based or combined-data prediction models over time. No eligible studies in 2013. (D)NN are all feedforward and deep neural network-based methods. (B) The combinations of text representations (left) and machine-learning methods (right) in text-based or combined-data prediction models. The number in parentheses indicates the number of prediction problems with these categories and the width of the connection between 2 categories represents the number of prediction problems with this combination of categories. Both (A) and (B) share the same legend: the colors of the nodes indicate the types of text representations and machine learning methods.
Figure 6.
Figure 6.
(A) Area under the receiver operating characteristic curve (AUC) difference distribution boxplots of the combined and structured-data models (ΔAUC Combined−Structured), the text and structured-data models (ΔAUC Text−Structured), and combined and text-based models (ΔAUC Combined−text). (B) Text and structured-data model AUC difference (ΔAUC Text−Structured) boxplots for 4 different clinical settings. In both (A) and (B), the means are indicated by a diamond, the points represent the underlying data, sample sizes are shown on top, and the dotted line indicates the AUC difference of zero. ns: not significant; *P < .05, ****P < .001.

References

    1. Reps JM, Schuemie MJ, Suchard MA, et al. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc 2018; 25 (8): 969–75. - PMC - PubMed
    1. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc 2017; 24 (1): 198–208. - PMC - PubMed
    1. Khalid S, Yang C, Blacketer C, et al. A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data. Comput Methods Programs Biomed 2021; 211: 106394. - PMC - PubMed
    1. Ford E, Carroll JA, Smith HE, et al. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 2016; 23 (5): 1007–1015. - PMC - PubMed
    1. Hahn U, Oleynik M. Medical information extraction in the age of deep learning. Yearb Med Inform 2020; 29 (1): 208–20. - PMC - PubMed

Publication types