Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep;12(9):1201-1212.
doi: 10.1002/psp4.12998. Epub 2023 Jun 15.

A systematic approach towards missing lab data in electronic health records: A case study in non-small cell lung cancer and multiple myeloma

Affiliations

A systematic approach towards missing lab data in electronic health records: A case study in non-small cell lung cancer and multiple myeloma

Arjun Sondhi et al. CPT Pharmacometrics Syst Pharmacol. 2023 Sep.

Abstract

Real-world data derived from electronic health records often exhibit high levels of missingness in variables, such as laboratory results, presenting a challenge for statistical analyses. We developed a systematic workflow for gathering evidence of different missingness mechanisms and performing subsequent statistical analyses. We quantify evidence for missing completely at random (MCAR) or missing at random (MAR), mechanisms using Hotelling's multivariate t-test, and random forest classifiers, respectively. We further illustrate how to apply sensitivity analyses using the not at random fully conditional specification procedure to examine changes in parameter estimates under missing not at random (MNAR) mechanisms. In simulation studies, we validated these diagnostics and compared analytic bias under different mechanisms. To demonstrate the application of this workflow, we applied it to two exemplary case studies with an advanced non-small cell lung cancer and a multiple myeloma cohort derived from a real-world oncology database. Here, we found strong evidence against MCAR, and some evidence of MAR, implying that imputation approaches that attempt to predict missing values by fitting a model to observed data may be suitable for use. Sensitivity analyses did not suggest meaningful departures of our analytic results under potential MNAR mechanisms; these results were also in line with results reported in clinical trials.

PubMed Disclaimer

Conflict of interest statement

A.S., P.Y., C.J., M.S., and S.C. all report employment in Flatiron Health Inc., which is an independent subsidiary of the Roche Group, and stock ownership in Roche. J.W. reports employment at Hoffmann‐La Roche, and stock ownership in Roche. M.T. reports employment at Genentech, a Member of the Roche Group, and stock ownership in Roche.

Figures

FIGURE 1
FIGURE 1
M‐graph representations of the four missingness mechanisms considered, with labobs representing observed laboratory values subject to missingness. (a) MCAR: Missingness (Mlab) is independent of the true laboratory values (lab) and any other variables X. (b) MAR: Missingness (Mlab) depends on observed variables X but is independent of true laboratory values (lab). (c) MNAR‐unmeasured: Missingness (Mlab) depends on unobserved variables U but is independent of true laboratory values (lab). (d) MNAR‐value: Missingness (Mlab) depends on true laboratory values (lab). A directed edge represents a causal effect of a variable on another. MAR, missing at random; MCAR, missing completely at random; MNAR, missing not at random.
FIGURE 2
FIGURE 2
Illustration of systematic workflow to diagnose potential missingness mechanisms. AUC, area under the receiver operating characteristic curve; H0, null hypothesis; MAR, missing at random; MCAR, missing completely at random; MNAR, missing not at random; NARFCS, not at random fully conditional specification; SMD, standardized mean difference.
FIGURE 3
FIGURE 3
Top: Distributions of L1 bias (sum of absolute biases across all hazard ratios estimated) in simulation studies, by missingness mechanism. Middle: Distributions of absolute bias for each hazard ratio estimated (laboratory, treatment, and covariate) in simulation studies, by missingness mechanism. Bottom: Coverage probabilities of 95% confidence intervals for each hazard ratio estimated (laboratory, treatment, and covariate) in simulation studies, by missingness mechanism. MAR, missing at random; MCAR, missing completely at random; MNAR, missing not at random.
FIGURE 4
FIGURE 4
AUC diagnostic results on real‐world cohorts. Covariates used in aNSCLC cohort: Age at index date, gender, index year, histology, group stage, smoking status, birth year, race/ethnicity, region, ECOG, age at diagnosis, age at advanced diagnosis, time from initial diagnosis to index, time from advanced diagnosis to index date, time from index date to end of follow‐up, censoring indicator, PD‐L1 status. Covariates used in MM cohort: Age at index date (1 L treatment start), ECOG at index date, gender, ISS stage, index year (calendar year of 1 L treatment start), practice type, race/ethnicity, region, time from diagnosis to index date, time from index date to end of follow‐up, censoring indicator, line of therapy. aNSCLC, advanced non‐small cell lung cancer; AUC, area under the receiver operating characteristic curve; ECOG, Eastern Cooperative Oncology Group; ISS, International Staging System; MM, multiple myeloma.
FIGURE 5
FIGURE 5
NARFCS sensitivity analysis results for real world analyses in NSCLC and MM. δ=0 indicates HR estimated under multiple imputation without any sensitivity adjustment. δ>0 indicates a shift in the imputation model where missing laboratory values are more likely to be normal than those observed; δ<0 indicates a shift in the imputation model where missing laboratory values are more likely to be abnormal than those observed. Y‐axis displays estimated HR with 95% confidence interval. HR, hazard ratio; MM, multiple myeloma; NARFCS, not at random fully conditional specification; NSCLC, non‐small cell lung cancer.

Similar articles

Cited by

References

    1. Miksad RA, Abernethy AP. Harnessing the power of real‐world evidence (RWE): a checklist to ensure regulatory‐grade data quality. Clin Pharmacol Ther. 2018;103:202‐205. - PMC - PubMed
    1. Becker T, Weberpals J, Jegg AM, et al. An enhanced prognostic score for overall survival of patients with cancer derived from a large real‐world cohort. Ann Oncol. 2020;31:1561‐1568. - PubMed
    1. Sv B. Flexible imputation of missing data. 2nd ed. CRC Press; 2018.
    1. Carpenter JR, Smuk M. Missing data: a statistical framework for practice. Biom J. 2021;63:915‐947. - PMC - PubMed
    1. Lee KJ, Tilling KM, Cornish RP, et al. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79‐88. - PMC - PubMed

Publication types