Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 17:6:66.
doi: 10.3389/fmed.2019.00066. eCollection 2019.

The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Affiliations

The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records

Michela Assale et al. Front Med (Lausanne). .

Abstract

Problem: Clinical practice requires the production of a time- and resource-consuming great amount of notes. They contain relevant information, but their secondary use is almost impossible, due to their unstructured nature. Researchers are trying to address this problems, with traditional and promising novel techniques. Application in real hospital settings seems not to be possible yet, though, both because of relatively small and dirty dataset, and for the lack of language-specific pre-trained models. Aim: Our aim is to demonstrate the potential of the above techniques, but also raise awareness of the still open challenges that the scientific communities of IT and medical practitioners must jointly address to realize the full potential of unstructured content that is daily produced and digitized in hospital settings, both to improve its data quality and leverage the insights from data-driven predictive models. Methods: To this extent, we present a narrative literature review of the most recent and relevant contributions to leverage the application of Natural Language Processing techniques to the free-text content electronic patient records. In particular, we focused on four selected application domains, namely: data quality, information extraction, sentiment analysis and predictive models, and automated patient cohort selection. Then, we will present a few empirical studies that we undertook at a major teaching hospital specializing in musculoskeletal diseases. Results: We provide the reader with some simple and affordable pipelines, which demonstrate the feasibility of reaching literature performance levels with a single institution non-English dataset. In such a way, we bridged literature and real world needs, performing a step further toward the revival of notes fields.

Keywords: clinical intelligence; data quality; information extraction; literature review; machine learning; natural language processing (NLP); sentiment analysis; text mining.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The figure shows the most relevant measures to assess (Errors prevention) and improve (Errors correction) the quality of a free written text. Errors prevention can be performed before errors are made (Auto-complete techniques), contextually (Spell-correction software) or with a simplification, to avoid redundant information insertion. Error correction can be based on string similarity, when it is a typing error, or sound similarity, when it is a spelling error. Disambiguation is used to find similar expression on a context-basis. The colored boxes highlight the experimental approach proposed in this article.
Figure 2
Figure 2
The figure shows the most relevant information extraction techniques available in literature. There is always a first step of pre-processing, where documents are tokenized and differently weighted. One approach is to extract known entities on a pattern-basis, with rule-based methods. These entities can be matched to an existing ontology or used to create one, or intended as a first step for a machine learning algorithm, which can be supervised (mainly classification) or unsupervised (clustering). The colored boxes highlight the experimental approach proposed in this article.
Figure 3
Figure 3
The image represents the ROC curves for the classification of four fields of the discharge letters of hip and knee. In blue, we see the results taken from an unbalanced training set, while in black we see the application of the same model trained on balanced train data. The numbers reported on the figures show the performances achieved by the two implemented models.
Figure 4
Figure 4
The figure represents the most common steps to perform and evaluate sentiment analysis. After a pre-processing phase, when documents are divided into words and normalized, Lexicon-based and Machine Learning-based approaches are described. In the first group of methods, each word may have a polarity, which can be modified by the surroundings. In the second group, the whole sentence is assigned with a sentiment, on the basis of other sentences, adapted from a similar domain The colored boxes highlight the experimental approach proposed in this article.
Figure 5
Figure 5
The figure represents the process of creation of a predictive model, mainly focusing on classification. Models can be selected on some prior considerations about the value of temporal order of information (stateful or stateless) and to reduce variance when data availability is scarce (ensemble methods). The extracted features are engineered and selected and, then, the model is trained on a subpart of the dataset. Performance are evaluated, depending on the goal of the model (sensitivity vs. specificity) and the target class representation in the dataset (accuracy vs. AUROC or F1-measure). The colored boxes highlight the experimental approach proposed in this article.
Figure 6
Figure 6
In the pie chart (A), we report the proportions of comments referring to satisfaction, divided between hospitality theme (orange) and nursing theme (magenta), with respect to those related to outcomes (green). In bar charts (B), for each of the categories identified, the number of negative (red), neutral (yellow) or positive (blue) comments is highlighted.
Figure 7
Figure 7
(A) Trend of accuracy, specificity and sensitivity according to threshold variations of sentiment predicted for the model created by word counts. (B) Trend of accuracy, specificity and sensitivity according to threshold variations of sentiment predicted for the model created through training on tweets.
Figure 8
Figure 8
The time course of the ODI score for patients with herniated disc is shown. This index refers to a disability, so 0 indicates the optimal condition, 100 indicates total disability. The black line superimposed on the boxplots refers to the average of the scores, with a 95% confidence interval.
Figure 9
Figure 9
(A) Shows the Sensitivity and (B) shows AUROC of the models for the prediction of the improvement in terms of ODI in patients of disc herniation. We report in order: Balanced Bayesian model, Unbalanced Bayesian model, Unbalanced Support Vector Machine, balanced Support Vector Machine, unbalanced Random Forest, balanced Random Forest.
Figure 10
Figure 10
The figure represents a schematization of cohort selection. After an information retrieval process, concepts are mapped to standard medical classifications and used to select the relevant EHR.

References

    1. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. J Biomed Health Informat. (2018) 22:1589–604. 10.1109/JBHI.2017.2767063 - DOI - PMC - PubMed
    1. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inform Sci Syst. (2014) 2:3. 10.1186/2047-2501-2-3 - DOI - PMC - PubMed
    1. Murdoch TB, Detsky AS. The inevitable application of big data to health care. J Am Med Assoc. (2013) 309:1351–2. 10.1001/jama.2013.393 - DOI - PubMed
    1. Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. Brit Med J. (2015) 350:h1885. 10.1136/bmj.h1885 - DOI - PMC - PubMed
    1. Fitzpatrick G. Integrated care and the working record. Health Inform J. (2004) 10:291–302. 10.1177/1460458204048507 - DOI

LinkOut - more resources