Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 15:10:1179597219858954.
doi: 10.1177/1179597219858954. eCollection 2019.

Extending Classification Algorithms to Case-Control Studies

Affiliations

Extending Classification Algorithms to Case-Control Studies

Bryan Stanfill et al. Biomed Eng Comput Biol. .

Abstract

Classification is a common technique applied to 'omics data to build predictive models and identify potential markers of biomedical outcomes. Despite the prevalence of case-control studies, the number of classification methods available to analyze data generated by such studies is extremely limited. Conditional logistic regression is the most commonly used technique, but the associated modeling assumptions limit its ability to identify a large class of sufficiently complicated 'omic signatures. We propose a data preprocessing step which generalizes and makes any linear or nonlinear classification algorithm, even those typically not appropriate for matched design data, available to be used to model case-control data and identify relevant biomarkers in these study designs. We demonstrate on simulated case-control data that both the classification and variable selection accuracy of each method is improved after applying this processing step and that the proposed methods are comparable to or outperform existing variable selection methods. Finally, we demonstrate the impact of conditional classification algorithms on a large cohort study of children with islet autoimmunity.

Keywords: Diabetes; biomarker discovery; machine learning; support vector machines; variable selection.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
A single dataset from the 2 variable simulation study is plotted in its raw form (A) and after controlling for the case-control design (B). A SVM with a radial-basis kernel function was trained to the pair corrected data, and the decision boundaries closely align with the true boundaries between classes (C). SVM indicates support vector machine.
Figure 2.
Figure 2.
Scatter plots of the classification accuracies for all conditional (x-axis) and standard (y-axis) methods and values of |δ|=0.125 when δ. The color of each point indicates which version, standard (red) or conditional (black), of each method is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.
Figure 3.
Figure 3.
Scatter plots of the variable selection accuracies for all conditional (x-axis) and standard (y-axis) methods and values of |δ|=0.125 when δ. The color of each point indicates which version, standard (red) or conditional (black), of each algorithm is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.
Figure 4.
Figure 4.
Box plots of the 200 repeated 5-fold cross-validation accuracies for the 4 different data types and 6 different classification algorithms. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.
Figure 5.
Figure 5.
The average rank of each conditional method for each data type where the algorithm with the lowest rank was the most accurate within each repeated cross-validation (CV) run. CLDA indicates conditional linear discriminant analysis; CLR, conditional logistic regression; CNB, conditional Naive Bayes; CRF, conditional random forests; CSVM, conditional support vector machine; RBF, radial basis function; SNP, single-nucleotide polymorphism.

References

    1. Rose S, Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation. Int J Biostat. 2009;5:1. - PMC - PubMed
    1. Adewale AJ, Dinu I, Yasui Y. Boosting for correlated binary classification. J Comput Graph Stat. 2010;19:140–153.
    1. Conway A, Rolley JX, Fulbrook P, Page K, Thompson DR. Improving statistical analysis of matched case–control studies. Res Nurs Health. 2013;36:320–324. - PubMed
    1. Breslow N, Day N, Halvorsen K, Prentice RL, Sabai C. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol. 1978;108:299–307. - PubMed
    1. Hogg T, Petkau J, Zhao Y, Gustafson P, Wijnands JM, Tremlett H. Bayesian analysis of pair-matched case-control studies subject to outcome misclassification. Stat Med. 2017;36:4196–4213. - PubMed

LinkOut - more resources