Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 30;10(1):103.
doi: 10.3390/jcm10010103.

Increasing the Density of Laboratory Measures for Machine Learning Applications

Affiliations

Increasing the Density of Laboratory Measures for Machine Learning Applications

Vida Abedi et al. J Clin Med. .

Abstract

Background: The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications.

Method: We analyzed the laboratory measures derived from Geisinger's EHR on patients in three distinct cohorts-patients tested for Clostridioides difficile (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns.

Results: We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for C. difficile infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as -35.5 for the Cdiff, -8.3 for the IBD, and -11.3 for the OA dataset.

Conclusions: An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.

Keywords: C. difficile infection; EHR; complex diseases; electronic health records; imputation; inflammatory bowel disease; laboratory measures; machine learning; medical informatics; osteoarthritis.

PubMed Disclaimer

Conflict of interest statement

Authors J.B.-R. and R.H. were employed by BioTherapeutics, Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interests. The funders had no role in study design, data collection, and interpretation or the decision to submit the work for publication.

Figures

Figure A1
Figure A1
Distribution of the laboratory values normalized for all the LOINC included in this study.
Figure 1
Figure 1
The pattern of missingness for the three cohorts. A generalized additive model was used for smoothing. The gray area around the smoothing curve represents a 95% confidence interval. (A) The percentage of patients with one laboratory measurement versus the missingness percentage for the three datasets. (B) The average number of years between the first and last laboratory measurements (calculated for patients with two or more measurements) versus the missingness percentage for the three datasets. (C) The frequency of the laboratory measurements calculated for patients with two or more measurements versus the missingness percentage for the three datasets. Cdiff: Clostridioides difficile, IBD: inflammatory bowel disease, and OA: osteoarthritis.
Figure 2
Figure 2
Distribution of laboratory values normalized for Logical Observation Identifiers Names and Codes (LOINC) 2501-5 (iron-binding capacity) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The “ironbinding capacity” is missing at 52% in the Cdiff dataset, 65% in the IBD dataset, and 64% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)).
Figure 3
Figure 3
Distribution of laboratory values normalized for LOINC 787-2 (mean corpuscular volume or MCV) for the three datasets (Cdiff in red, IBD in green, and OA in blue). The “MCV” is missing at 2% in the Cdiff dataset, 5% in the IBD dataset, and 4% in the OA dataset. The subpanels represent the three modeled distributions to calculate the upper and lower boundaries. The dashed lines represent the upper and lower outlier boundaries (based on Equation (1)).
Figure 4
Figure 4
Violin plots representing the root mean square error (RMSE) differences—comparing the performance of Multivariate Imputation by Chained Equations (MICE) with and without the comorbidity information. Two algorithms, predictive mean matching (pmm) and Random Forest (rf), were compared. A Negative RMSE difference indicates a performance improvement when the comorbidity information is utilized.

References

    1. Noorbakhsh-Sabet N., Zand R., Zhang Y., Abedi V. Artificial Intelligence Transforms the Future of Health Care. Am. J. Med. 2019;132:795–801. doi: 10.1016/j.amjmed.2019.01.017. - DOI - PMC - PubMed
    1. Botsis T., Hartvigsen G., Chen F., Weng C. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities. AMIA Jt. Summits Transl. Sci. 2010;1:1–5. - PMC - PubMed
    1. Sterne J., White I.R., Carlin J.B., Spratt M., Royston P., Kenward M.G., Wood A.M., Carpenter J.R. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ. 2009;338:b2393. doi: 10.1136/bmj.b2393. - DOI - PMC - PubMed
    1. Netten A.P., Dekker F.W., Rieffe C., Soede W., Briaire J.J., Frijns J.H.M. Missing Data in the Field of Otorhinolaryngology and Head & Neck Surgery. Ear Hear. 2017;38:1–6. doi: 10.1097/aud.0000000000000346. - DOI - PubMed
    1. Beaulieu-Jones B.K., Lavage D.R., Snyder J.W., Moore J.H., Pendergrass S.A., Bauer C.R. Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis. JMIR Med. Inform. 2018;6:e11. doi: 10.2196/medinform.8960. - DOI - PMC - PubMed

LinkOut - more resources