Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Oct 15;31(23):2577-87.
doi: 10.1002/sim.5328. Epub 2012 Mar 13.

Misuse of DeLong test to compare AUCs for nested models

Affiliations

Misuse of DeLong test to compare AUCs for nested models

Olga V Demler et al. Stat Med. .

Abstract

The area under the receiver operating characteristics curve (AUC of ROC) is a widely used measure of discrimination in risk prediction models. Routinely, the Mann-Whitney statistics is used as an estimator of AUC, while the change in AUC is tested by the DeLong test. However, very often, in settings where the model is developed and tested on the same dataset, the added predictor is statistically significantly associated with the outcome but fails to produce a significant improvement in the AUC. No conclusive resolution exists to explain this finding. In this paper, we will show that the reason lies in the inappropriate application of the DeLong test in the setting of nested models. Using numerical simulations and a theoretical argument based on generalized U-statistics, we show that if the added predictor is not statistically significantly associated with the outcome, the null distribution is non-normal, contrary to the assumption of DeLong test. Our simulations of different scenarios show that the loss of power because of such a misuse of the DeLong test leads to a conservative test for small and moderate effect sizes. This problem does not exist in cases of predictors that are associated with the outcome and for non-nested models. We suggest that for nested models, only the test of association be performed for the new predictors, and if the result is significant, change in AUC be estimated with an appropriate confidence interval, which can be based on the DeLong approach.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Scatterplot of p-values produced by F -test versus corresponding p-values produced by DeLong test. 1000 simulations of multivariate normal data with sample size of 8261.
Figure 2
Figure 2
Histogram of change in eAUC under null hypothesis for multivariate normal data and sample size of 8365 with superimposed plot of corresponding distribution function used by DeLong test.
Figure 3
Figure 3
(A) Power of Wald test, DeLong test, and test based on bootstrap for different conditional effect sizes. On the basis of real-life data sample size 8261 (with 621 cases) baseline AUC is 0.76. (B) Power of F -test, DeLong test, and test based on bootstrap conditional effect sizes. On the basis of simulated multivariate normal data sample size 700 (with 53 cases) baseline AUC is 0.76.
Figure 4
Figure 4
Histogram of distribution of change in eAUC under alternative hypothesis. Simulations were performed for conditional effect size of 0.25 of multivariate normal data with sample size of 8261.

References

    1. Harrell FE., Jr . Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression and Survival Analysis. New York: Springer-Verlag; 2001.
    1. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating (Statistics for Biology and Health) New York: Springer Science+Business Media; 2009.
    1. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press; 2004.
    1. Stone B. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological) 1974;36(2):111–147.
    1. Ridker PM, Rifai N, Rose L, Buring JE, Cook NR. Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. NEJM. 2003;348:1059–1061. - PubMed

LinkOut - more resources