Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 1;190(9):1830-1840.
doi: 10.1093/aje/kwab010.

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis

Tammy Jiang et al. Am J Epidemiol. .

Abstract

Although variables are often measured with error, the impact of measurement error on machine-learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on the performance of random-forest models and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random-forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the National Comorbidity Survey Replication (2001-2003). Second, we created simulated data sets in which we knew the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the data sets. Our findings showed that measurement error in the data used to construct random forests can distort model performance and variable importance measures and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Keywords: machine learning; measurement error; misclassification; noise; quantitative bias analysis; random forests.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Performance of random forests in predicting suicide attempts before and after adjustment for nondifferential predictor misclassification in the National Comorbidity Survey Replication, 2001–2003. The solid line (top line) represents the negative predictive value. The dashed line (second from top) represents the area under the receiver operating characteristic curve. The long-dashed line (third from top) represents sensitivity; it overlaps with the 2-dashed line representing accuracy. The dotted line (second from bottom) represents sensitivity. The dotted-dashed line (bottom line) represents the positive predictive value.
Figure 2
Figure 2
Variable importance of random forests in predicting suicide attempts in the original National Comorbidity Survey Replication data set, 2001–2003. The mean decrease in accuracy represents the reduction in the overall accuracy of the random-forest model when a predictor is permuted. “Depression” refers to major depressive disorder. PTSD, posttraumatic stress disorder.
Figure 3
Figure 3
Variable importance of random forests in predicting suicide attempts in the bias-adjusted data set adjusting for nondifferential misclassification of predictors, National Comorbidity Survey Replication, 2001–2003. The mean decrease in accuracy represents the reduction in the overall accuracy of the random-forest model when a predictor is permuted. “Depression” refers to major depressive disorder. PTSD, posttraumatic stress disorder.
Figure 4
Figure 4
Performance of random forests in simulated informative data set 1 after inducing misclassification. Each point indicates the median performance of the model across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Model performance in the original simulated informative data set 1 without misclassification; B) model performance under the scenario of nondifferential misclassification of all predictors in which the sensitivity of each predictor was 0.45 and specificity was 0.99; C) model performance under the scenario of differential misclassification of all predictors in which the sensitivity of each predictor among persons with the outcome was 0.50, specificity of each predictor among those with the outcome was 0.90, the sensitivity of each predictor among those without the outcome was 0.45, and specificity of each predictor among those without the outcome was 0.95; D) model performance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) model performance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.95, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.99. AUC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.
Figure 5
Figure 5
Variable importance of random forests in simulated informative data set 2 after inducing misclassification. Each point indicates the median importance value of the random-forest variable across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Variable importance in the original simulated informative data set 2 without misclassification; B) variable importance under the scenario of nondifferential misclassification of X1 in which the sensitivity of predictor X1 was 0.45 and specificity was 0.99; C) variable importance under the scenario of differential misclassification of X1 in which the sensitivity of predictor X1 among persons with the outcome was 0.50, specificity of predictor X1 among those with the outcome was 0.95, the sensitivity of predictor X1 among those without the outcome was 0.45, and specificity of predictor X1 among those without the outcome was 0.90; D) variable importance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.90 and specificity was 0.99; E) variable importance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.99, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.95.
Figure 6
Figure 6
Variable importance of random forests in the simulated uninformative data set after inducing misclassification. Each point indicates the median importance value of the random-forest variable across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Variable importance in the original simulated uninformative data set without misclassification; B) variable importance under the scenario of nondifferential misclassification of X5 in which the sensitivity of predictor X5 was 0.45 and specificity was 0.99; C) variable importance under the scenario of differential misclassification of X5 in which the sensitivity of predictor X5 among persons with the outcome was 0.45, specificity of predictor X5 among those with the outcome was 0.99, the sensitivity of predictor X5 among those without the outcome was 0.40, and specificity of predictor X5 among those without the outcome was 0.95; D) variable importance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) variable importance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X5 was 0.9, specificity among those with predictor X5 was 0.99, the sensitivity among those without predictor X5 was 0.85, and specificity among those without predictor X5 was 0.95.
Figure 7
Figure 7
Performance of random forests in simulated informative data set 1 after conducting bias adjustment for misclassification. Each point indicates the median performance of the model across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). The misclassification probabilities for each misclassification scenario are the same as those described in the legend of Figure 4. A) Model performance in the original simulated informative data set 1 without misclassification; B) model performance after conducting bias adjustment for nondifferential misclassification of all predictors in which the sensitivity of each predictor was 0.45 and specificity was 0.99; C) model performance after conducting bias adjustment for differential misclassification of all predictors in which the sensitivity of each predictor among persons with the outcome was 0.50, specificity of each predictor among those with the outcome was 0.90, the sensitivity of each predictor among those without the outcome was 0.45, and specificity of each predictor among those without the outcome was 0.95; D) model performance after conducting bias adjustment for nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) model performance after conducting bias adjustment for differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.95, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.99. AUC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.

Comment in

References

    1. Mitchell TM. Machine Learning. 1st ed. New York, NY: McGraw-Hill; 1997.
    1. Gianfrancesco MA, Tamang S, Yazdany J, et al. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544–1547. - PMC - PubMed
    1. Simon GE. Big data from health records in mental health care: hardly clairvoyant but already useful. JAMA Psychiatry. 2019;76(4):349–350. - PubMed
    1. Whittle R, Peat G, Belcher J, et al. Measurement error and timing of predictor values for multivariable risk prediction models are poorly reported. J Clin Epidemiol. 2018;102:38–49. - PubMed
    1. Lash TL, Fox MP, Fink AK. Applying Quantitative Bias Analysis to Epidemiologic Data. New York, NY: Springer-Verlag; 2009.

Publication types

MeSH terms