The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Michela Assale^{1

2}, Linda Greta Dui^{3

4}, Andrea Cina^{1

2}, Andrea Seveso^{2

4}, Federico Cabitza^{2

5}

Affiliations

¹ K-tree SRL, Pont-Saint-Martin, Italy.
² University of Milano-Bicocca, Milan, Italy.
³ Politecnico di Milano, Milan, Italy.
⁴ Link-Up Datareg, Cinisello Balsamo, Italy.
⁵ IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.

PMID: 31058150
PMCID: PMC6478793
DOI: 10.3389/fmed.2019.00066

The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Michela Assale et al. Front Med (Lausanne). 2019.

. 2019 Apr 17:6:66.

doi: 10.3389/fmed.2019.00066. eCollection 2019.

Authors

Michela Assale^{1

2}, Linda Greta Dui^{3

4}, Andrea Cina^{1

2}, Andrea Seveso^{2

4}, Federico Cabitza^{2

5}

Affiliations

¹ K-tree SRL, Pont-Saint-Martin, Italy.
² University of Milano-Bicocca, Milan, Italy.
³ Politecnico di Milano, Milan, Italy.
⁴ Link-Up Datareg, Cinisello Balsamo, Italy.
⁵ IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.

PMID: 31058150
PMCID: PMC6478793
DOI: 10.3389/fmed.2019.00066

Abstract

Problem: Clinical practice requires the production of a time- and resource-consuming great amount of notes. They contain relevant information, but their secondary use is almost impossible, due to their unstructured nature. Researchers are trying to address this problems, with traditional and promising novel techniques. Application in real hospital settings seems not to be possible yet, though, both because of relatively small and dirty dataset, and for the lack of language-specific pre-trained models. Aim: Our aim is to demonstrate the potential of the above techniques, but also raise awareness of the still open challenges that the scientific communities of IT and medical practitioners must jointly address to realize the full potential of unstructured content that is daily produced and digitized in hospital settings, both to improve its data quality and leverage the insights from data-driven predictive models. Methods: To this extent, we present a narrative literature review of the most recent and relevant contributions to leverage the application of Natural Language Processing techniques to the free-text content electronic patient records. In particular, we focused on four selected application domains, namely: data quality, information extraction, sentiment analysis and predictive models, and automated patient cohort selection. Then, we will present a few empirical studies that we undertook at a major teaching hospital specializing in musculoskeletal diseases. Results: We provide the reader with some simple and affordable pipelines, which demonstrate the feasibility of reaching literature performance levels with a single institution non-English dataset. In such a way, we bridged literature and real world needs, performing a step further toward the revival of notes fields.

Keywords: clinical intelligence; data quality; information extraction; literature review; machine learning; natural language processing (NLP); sentiment analysis; text mining.

PubMed Disclaimer

Figures

**Figure 3**
The image represents the ROC curves for the classification of four fields of the discharge letters of hip and knee. In blue, we see the results taken from an unbalanced training set, while in black we see the application of the same model trained on balanced train data. The numbers reported on the figures show the performances achieved by the two implemented models.

**Figure 4**
The figure represents the most common steps to perform and evaluate sentiment analysis. After a pre-processing phase, when documents are divided into words and normalized, Lexicon-based and Machine Learning-based approaches are described. In the first group of methods, each word may have a polarity, which can be modified by the surroundings. In the second group, the whole sentence is assigned with a sentiment, on the basis of other sentences, adapted from a similar domain The colored boxes highlight the experimental approach proposed in this article.

**Figure 5**
The figure represents the process of creation of a predictive model, mainly focusing on classification. Models can be selected on some prior considerations about the value of temporal order of information (stateful or stateless) and to reduce variance when data availability is scarce (ensemble methods). The extracted features are engineered and selected and, then, the model is trained on a subpart of the dataset. Performance are evaluated, depending on the goal of the model (sensitivity vs. specificity) and the target class representation in the dataset (accuracy vs. AUROC or F1-measure). The colored boxes highlight the experimental approach proposed in this article.

**Figure 6**
In the pie chart **(A)**, we report the proportions of comments referring to satisfaction, divided between hospitality theme (orange) and nursing theme (magenta), with respect to those related to outcomes (green). In bar charts **(B)**, for each of the categories identified, the number of negative (red), neutral (yellow) or positive (blue) comments is highlighted.

**Figure 7**
**(A)** Trend of accuracy, specificity and sensitivity according to threshold variations of sentiment predicted for the model created by word counts. **(B)** Trend of accuracy, specificity and sensitivity according to threshold variations of sentiment predicted for the model created through training on tweets.

**Figure 8**
The time course of the ODI score for patients with herniated disc is shown. This index refers to a disability, so 0 indicates the optimal condition, 100 indicates total disability. The black line superimposed on the boxplots refers to the average of the scores, with a 95% confidence interval.

**Figure 9**
**(A)** Shows the Sensitivity and **(B)** shows AUROC of the models for the prediction of the improvement in terms of ODI in patients of disc herniation. We report in order: Balanced Bayesian model, Unbalanced Bayesian model, Unbalanced Support Vector Machine, balanced Support Vector Machine, unbalanced Random Forest, balanced Random Forest.

**Figure 10**
The figure represents a schematization of cohort selection. After an information retrieval process, concepts are mapped to standard medical classifications and used to select the relevant EHR.

See this image and copyright information in PMC

References

1. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. J Biomed Health Informat. (2018) 22:1589–604. 10.1109/JBHI.2017.2767063 - DOI - PMC - PubMed
1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inform Sci Syst. (2014) 2:3. 10.1186/2047-2501-2-3 - DOI - PMC - PubMed
1. Murdoch TB, Detsky AS. The inevitable application of big data to health care. J Am Med Assoc. (2013) 309:1351–2. 10.1001/jama.2013.393 - DOI - PubMed
1. Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. Brit Med J. (2015) 350:h1885. 10.1136/bmj.h1885 - DOI - PMC - PubMed
1. Fitzpatrick G. Integrated care and the working record. Health Inform J. (2004) 10:291–302. 10.1177/1460458204048507 - DOI

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Affiliations

The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources