Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;13 Suppl 14(Suppl 14):S2.
doi: 10.1186/1471-2105-13-S14-S2. Epub 2012 Sep 7.

Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data

Affiliations

Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data

Francesco Sambo et al. BMC Bioinformatics. 2012.

Abstract

Background: Multifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.

Results: In this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.

Conclusions: The significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.

Availability: Source code of the BoNB algorithm is released under the GNU General Public Licence and is available at http://www.dei.unipd.it/~sambofra/bonb.html.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematics of the BoNB algorithm: B Bootstrap samples {X(1) . . . X(B)} are drawn from a GWAS training dataset X; B Naïve Bayes Classifiers (NBC) are trained on the Bootstrap samples, with the novel procedure for attribute ranking and selection; predictions of unseen subjects from a GWAS test dataset are carried out independently by each NBC and class probabilities are then averaged; biomarker selection is carried out with the novel permutation-based procedure, exploiting Out-of-Bag (OOB) samples.
Figure 2
Figure 2
Box plots of MCC (left panel) and classification accuracy (right panel) of the standard Naïve Bayes classifier, HyperLASSO and BoNB on ten random subsamplings of the WTCCC T1D dataset. The dashed lines represent the classification performance of a majority classifier.
Figure 3
Figure 3
Precision vs Recall curve (left panel) and Receiver Operating Characteristic (right panel) of the standard Naïve Bayes classifier, HyperLASSO and BoNB on a random subsampling of the WTCCC T1D dataset.
Figure 4
Figure 4
Naïve Bayes attribute score vs χ2 statistic for all SNPs in the WTCCC T1D dataset.
Figure 5
Figure 5
Box plots of the MCC obtained by BoNB on ten random subsamplings of the WTCCC T1D dataset, for B = 200 and θ ranging from 0.02 to 0.5 (left panel) and for θ = 0.1 and B ranging from 50 to 500 (right panel).

References

    1. Ku CS, Loy EY, Pawitan Y, Chia KS. The pursuit of genome-wide association studies: where are we now? Journal of Human Genetics. 2010;55(4):195–206. doi: 10.1038/jhg.2010.19. - DOI - PubMed
    1. The GIANT Consortium. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467(7317):832–838. doi: 10.1038/nature09410. - DOI - PMC - PubMed
    1. The GIANT Consortium. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genetics. 2009;41:25–34. doi: 10.1038/ng.287. - DOI - PMC - PubMed
    1. Paterson AD, Waggott D, Boright AP, Hosseini SM, Shen E, Sylvestre MPP, Wong I, Bharaj B, Cleary PA, Lachin JM. MAGIC (Meta-Analyses of Glucose and Insulin-related traits Consortium); Below JE, Nicolae D, Cox NJ, Canty AJ, Sun L, Bull SB. Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications Research Group. A genome-wide association study identifies a novel major locus for glycemic control in type 1 diabetes, as measured by both A1C and glucose. Diabetes. 2010;59(2):539–549. doi: 10.2337/db09-0653. - DOI - PMC - PubMed
    1. Wayne R, Vonholdt B. Evolutionary genomics of dog domestication. Mammalian Genome. 2012;23(1-2):3–18. doi: 10.1007/s00335-011-9386-7. - DOI - PubMed

Publication types

Substances