Extending Classification Algorithms to Case-Control Studies

Bryan Stanfill¹, Sarah Reehl¹, Lisa Bramer¹, Ernesto S Nakayasu², Stephen S Rich³, Thomas O Metz², Marian Rewers⁴, Bobbie-Jo Webb-Robertson²; TEDDY Study Group

Affiliations

¹ Computing and Analytics Division, National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
² Biological Sciences Division, Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
³ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
⁴ Barbara Davis Center for Childhood Diabetes, University of Colorado Denver, Aurora, CO, USA.

PMID: 31320812
PMCID: PMC6630079
DOI: 10.1177/1179597219858954

Extending Classification Algorithms to Case-Control Studies

Bryan Stanfill et al. Biomed Eng Comput Biol. 2019.

. 2019 Jul 15:10:1179597219858954.

doi: 10.1177/1179597219858954. eCollection 2019.

Authors

Bryan Stanfill¹, Sarah Reehl¹, Lisa Bramer¹, Ernesto S Nakayasu², Stephen S Rich³, Thomas O Metz², Marian Rewers⁴, Bobbie-Jo Webb-Robertson²; TEDDY Study Group

Affiliations

¹ Computing and Analytics Division, National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
² Biological Sciences Division, Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
³ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
⁴ Barbara Davis Center for Childhood Diabetes, University of Colorado Denver, Aurora, CO, USA.

PMID: 31320812
PMCID: PMC6630079
DOI: 10.1177/1179597219858954

Abstract

Classification is a common technique applied to 'omics data to build predictive models and identify potential markers of biomedical outcomes. Despite the prevalence of case-control studies, the number of classification methods available to analyze data generated by such studies is extremely limited. Conditional logistic regression is the most commonly used technique, but the associated modeling assumptions limit its ability to identify a large class of sufficiently complicated 'omic signatures. We propose a data preprocessing step which generalizes and makes any linear or nonlinear classification algorithm, even those typically not appropriate for matched design data, available to be used to model case-control data and identify relevant biomarkers in these study designs. We demonstrate on simulated case-control data that both the classification and variable selection accuracy of each method is improved after applying this processing step and that the proposed methods are comparable to or outperform existing variable selection methods. Finally, we demonstrate the impact of conditional classification algorithms on a large cohort study of children with islet autoimmunity.

Keywords: Diabetes; biomarker discovery; machine learning; support vector machines; variable selection.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

**Figure 1.**
A single dataset from the 2 variable simulation study is plotted in its raw form (A) and after controlling for the case-control design (B). A SVM with a radial-basis kernel function was trained to the pair corrected data, and the decision boundaries closely align with the true boundaries between classes (C). SVM indicates support vector machine.

**Figure 2.**
Scatter plots of the classification accuracies for all conditional (x-axis) and standard (y-axis) methods and values of $| δ | = 0.125$ when $δ$ . The color of each point indicates which version, standard (red) or conditional (black), of each method is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

**Figure 3.**
Scatter plots of the variable selection accuracies for all conditional (x-axis) and standard (y-axis) methods and values of $| δ | = 0.125$ when $δ$ . The color of each point indicates which version, standard (red) or conditional (black), of each algorithm is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

**Figure 4.**
Box plots of the 200 repeated 5-fold cross-validation accuracies for the 4 different data types and 6 different classification algorithms. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

**Figure 5.**
The average rank of each conditional method for each data type where the algorithm with the lowest rank was the most accurate within each repeated cross-validation (CV) run. CLDA indicates conditional linear discriminant analysis; CLR, conditional logistic regression; CNB, conditional Naive Bayes; CRF, conditional random forests; CSVM, conditional support vector machine; RBF, radial basis function; SNP, single-nucleotide polymorphism.

See this image and copyright information in PMC

References

1. Rose S, Laan MJ. Why match? investigating matched case-control study designs with causal effect estimation. Int J Biostat. 2009;5:1. - PMC - PubMed
1. Adewale AJ, Dinu I, Yasui Y. Boosting for correlated binary classification. J Comput Graph Stat. 2010;19:140–153.
1. Conway A, Rolley JX, Fulbrook P, Page K, Thompson DR. Improving statistical analysis of matched case–control studies. Res Nurs Health. 2013;36:320–324. - PubMed
1. Breslow N, Day N, Halvorsen K, Prentice RL, Sabai C. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol. 1978;108:299–307. - PubMed
1. Hogg T, Petkau J, Zhao Y, Gustafson P, Wijnands JM, Tremlett H. Bayesian analysis of pair-matched case-control studies subject to outcome misclassification. Stat Med. 2017;36:4196–4213. - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extending Classification Algorithms to Case-Control Studies

Affiliations

Extending Classification Algorithms to Case-Control Studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources