Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives

Sebastian Okser¹, Tapio Pahikkala, Tero Aittokallio

Affiliations

PMID: 23448398
PMCID: PMC3606427
DOI: 10.1186/1756-0381-6-5

Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives

Sebastian Okser et al. BioData Min. 2013.

. 2013 Mar 1;6(1):5.

doi: 10.1186/1756-0381-6-5.

Authors

Sebastian Okser¹, Tapio Pahikkala, Tero Aittokallio

Affiliation

¹ Turku Centre for Computer Science (TUCS), Turku, Finland. tero.aittokallio@helsinki.fi.

PMID: 23448398
PMCID: PMC3606427
DOI: 10.1186/1756-0381-6-5

Abstract

A central challenge in systems biology and medical genetics is to understand how interactions among genetic loci contribute to complex phenotypic traits and human diseases. While most studies have so far relied on statistical modeling and association testing procedures, machine learning and predictive modeling approaches are increasingly being applied to mining genotype-phenotype relationships, also among those associations that do not necessarily meet statistical significance at the level of individual variants, yet still contributing to the combined predictive power at the level of variant panels. Network-based analysis of genetic variants and their interaction partners is another emerging trend by which to explore how sub-network level features contribute to complex disease processes and related phenotypes. In this review, we describe the basic concepts and algorithms behind machine learning-based genetic feature selection approaches, their potential benefits and limitations in genome-wide setting, and how physical or genetic interaction networks could be used as a priori information for providing improved predictive power and mechanistic insights into the disease networks. These developments are geared toward explaining a part of the missing heritability, and when combined with individual genomic profiling, such systems medicine approaches may also provide a principled means for tailoring personalized treatment strategies in the future.

PubMed Disclaimer

Figures

**Figure 1**
**The figure illustrates how the external and internal cross-validation results behave as functions of the number of selected features.** The external-cross validation consists of three training/test splits. The wrapper-based feature selection method, greedy RLS [23], is separately run during each round of the external cross-validation. Greedy RLS, in turn, employs an internal leave-one-out cross-validation on the training set for scoring the feature set candidates. The red curve depicts the mean values over these internal cross-validation errors. As can be easily observed from the blue curve, this internal cross-validation MSE used for the model training keeps constantly improving, which is expected, because the internal cross-validation quickly overfits to the training data when it is used as a selection measure. The blue curve depicts the area under curve (AUC) on the test data, held out during the external cross-validation round, that is, data completely unseen during the internal cross-validation and feature selection process. In contrast to the red curve, the blue curve starts to level off soon after the number of selected variants reaches around 10, indicating that adding extra features is not beneficial anymore even if the internal scoring function keeps improving. The green curve depicts the AUC of the RLS model trained using features selected by single-locus p-value based filter method, Fisher’s exact test, which is run with the same external training/test split as the greedy selection method. Similarly to the blue curve, the green one also stops improving soon after a relatively small set of features has been selected. The data used in the experiments is the Wellcome Trust Case Controls Consortium (WTCCC) Hypertension dataset combined with the UK National Blood Services’ controls.

**Figure 2**
**Sample network visualization constructed for type 1 diabetes.** The risk variants were selected using the greedy RLS on the WTCCC type 1 diabetes GWAS data and the UK National Blood Services’ controls, extended with those genes selected in another work [62]. The biological processes and pathways were then mapped using DAVID [112,113], and the network visualization was done with the Enrichment Map plugin for Cytoscape [114,115]. The nodes represent pathways and the edges are the amount of overlap between the members of the pathways. The visualized network represents a selected sub-network of complex interconnections and cross-talks between a number of pathways, including MHC-related processes and other biological pathways associated with diabetes phenotypes. The pathways were identified initially using DAVID, with the criteria that they demonstrate enrichment when compared to the genome-wide background. The retrieved pathways were subsequently filtered in Cytoscape through the Enrichment Map plugin using the false-discovery rate and overlap coefficient to filter out non-significant pathways.

See this image and copyright information in PMC

References

1. Ashley EA. Clinical assessment incorporating a personal genome. Lancet. 2010;375(9725):1525–1535. doi: 10.1016/S0140-6736(10)60452-7. - DOI - PMC - PubMed
1. Ripatti S. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet. 2010;376(9750):1393–1400. doi: 10.1016/S0140-6736(10)61267-6. - DOI - PMC - PubMed
1. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
1. Donnelly P. Progress and challenges in genome-wide association studies in humans. Nature. 2008;456(7223):728–731. doi: 10.1038/nature07631. - DOI - PubMed
1. Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363(2):166–176. doi: 10.1056/NEJMra0905980. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives

Affiliation

Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives

Authors

Affiliation

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources