Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 30:12:90.
doi: 10.1186/1471-2350-12-90.

Genome Wide Association Study to predict severe asthma exacerbations in children using random forests classifiers

Affiliations

Genome Wide Association Study to predict severe asthma exacerbations in children using random forests classifiers

Mousheng Xu et al. BMC Med Genet. .

Abstract

Background: Personalized health-care promises tailored health-care solutions to individual patients based on their genetic background and/or environmental exposure history. To date, disease prediction has been based on a few environmental factors and/or single nucleotide polymorphisms (SNPs), while complex diseases are usually affected by many genetic and environmental factors with each factor contributing a small portion to the outcome. We hypothesized that the use of random forests classifiers to select SNPs would result in an improved predictive model of asthma exacerbations. We tested this hypothesis in a population of childhood asthmatics.

Methods: In this study, using emergency room visits or hospitalizations as the definition of a severe asthma exacerbation, we first identified a list of top Genome Wide Association Study (GWAS) SNPs ranked by Random Forests (RF) importance score for the CAMP (Childhood Asthma Management Program) population of 127 exacerbation cases and 290 non-exacerbation controls. We predict severe asthma exacerbations using the top 10 to 320 SNPs together with age, sex, pre-bronchodilator FEV1 percentage predicted, and treatment group.

Results: Testing in an independent set of the CAMP population shows that severe asthma exacerbations can be predicted with an Area Under the Curve (AUC)=0.66 with 160-320 SNPs in comparison to an AUC score of 0.57 with 10 SNPs. Using the clinical traits alone yielded AUC score of 0.54, suggesting the phenotype is affected by genetic as well as environmental factors.

Conclusions: Our study shows that a random forests algorithm can effectively extract and use the information contained in a small number of samples. Random forests, and other machine learning tools, can be used with GWAS studies to integrate large numbers of predictors simultaneously.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The "manhattan plot" of RF importance scores of all the 550k SNPs. X-axis: the SNPs in chromosomal order; Y-axis: the RF importance scores. The black demarcation separates the top 4k SNPs from the rest.
Figure 2
Figure 2
Comparison of performance of predicting severe asthma exacerbation with different methods. Y-axis: AUC; X-axis: the number of SNPs used in a model. "Random SNPs": SNPs are chosen randomly from all SNPs and used as input variables to predict asthma exacerbations, and this process has been iterated 10 times [see Methods for details]; "Permuted": asthma exacerbation is permuted across samples while clinical traits and SNPs are kept with the samples, and this process has been iterated 10 times [see Methods for details]; "Training": the AUC of the model trained and built with all the Stage 1 samples predicting on the same samples; "Internal cross-validation": the AUC of the model built with 90% of the randomly selected Stage 1 samples predicting on the rest (10%) of the Stage 1 samples; "Independent replication": the AUC of the model built with all the Stage 1 samples predicting on all the Stage 2 samples.
Figure 3
Figure 3
ROC curves using clinical attributes plus 160 SNPs as predictors. The red curve is obtained for the training of the Stage 1 samples, the blue curve is for the testing of the Stage 2 samples, the grey diagonal line is a theoretical curve representing random guess. Both the red and the blue curves are higher than the grey line, indicating better than random prediction; and they are similar to each other, suggesting the true predictability of the RF model. The p-value for the independent testing AUC to be different from 0.5 is 0.000266.
Figure 4
Figure 4
Performance comparison of predicting severe asthma exacerbation with or without clinical traits. Y-axis: AUC; X-axis: the number of SNPs used for prediction. Blue: SNPs plus clinical traits; Red: SNPs alone.

References

    1. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40(6):695–701. doi: 10.1038/ng.f.136. - DOI - PMC - PubMed
    1. Haga SB, Khoury MJ, Burke W. Genomic profiling to promote a healthy lifestyle: not ready for prime time. Nat Genet. 2003;34(4):347–350. doi: 10.1038/ng0803-347. - DOI - PubMed
    1. Katsanis SH, Javitt G, Hudson K. Public health. A case study of personalized medicine. Science. 2008;320(5872):53–54. doi: 10.1126/science.1156604. - DOI - PubMed
    1. Tate SK, Goldstein DB. Will tomorrow's medicines work for everyone? Nat Genet. 2004;36(11 Suppl):S34–42. - PubMed
    1. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS, Samani NJ, Shields B, Prokopenko I, Farrall M, Dominiczak A. Diabetes Genetics Initiative; Wellcome Trust Case Control Consortium; Johnson T, Bergmann S, Beckmann JS, Vollenweider P, Waterworth DM, Mooser V, Palmer CN, Morris AD, Ouwehand WH. Cambridge GEM Consortium. Zhao JH, Li S, Loos RJ. et al.Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40(5):575–583. doi: 10.1038/ng.121. - DOI - PMC - PubMed

Publication types