Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Hannah L Nicholls^{1

2}, Christopher R John^{2

3}, David S Watson^{2

4}, Patricia B Munroe^{1

5}, Michael R Barnes^{1

2

5

6}, Claudia P Cabrera^{1

2

5}

Affiliations

¹ Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
² Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
³ Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
⁴ Oxford Internet Institute, University of Oxford, Oxford, United Kingdom.
⁵ NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
⁶ The Alan Turing Institute, British Library, London, United Kingdom.

PMID: 32351543
PMCID: PMC7174742
DOI: 10.3389/fgene.2020.00350

Review

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Hannah L Nicholls et al. Front Genet. 2020.

. 2020 Apr 15:11:350.

doi: 10.3389/fgene.2020.00350. eCollection 2020.

Authors

Hannah L Nicholls^{1

2}, Christopher R John^{2

3}, David S Watson^{2

4}, Patricia B Munroe^{1

5}, Michael R Barnes^{1

2

5

6}, Claudia P Cabrera^{1

2

5}

Affiliations

¹ Clinical Pharmacology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
² Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
³ Centre for Experimental Medicine and Rheumatology, William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
⁴ Oxford Internet Institute, University of Oxford, Oxford, United Kingdom.
⁵ NIHR Barts Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom.
⁶ The Alan Turing Institute, British Library, London, United Kingdom.

PMID: 32351543
PMCID: PMC7174742
DOI: 10.3389/fgene.2020.00350

Abstract

Genome-wide association studies (GWAS) have revealed thousands of genetic loci that underpin the complex biology of many human traits. However, the strength of GWAS - the ability to detect genetic association by linkage disequilibrium (LD) - is also its limitation. Whilst the ever-increasing study size and improved design have augmented the power of GWAS to detect effects, differentiation of causal variants or genes from other highly correlated genes associated by LD remains the real challenge. This has severely hindered the biological insights and clinical translation of GWAS findings. Although thousands of disease susceptibility loci have been reported, causal genes at these loci remain elusive. Machine learning (ML) techniques offer an opportunity to dissect the heterogeneity of variant and gene signals in the post-GWAS analysis phase. ML models for GWAS prioritization vary greatly in their complexity, ranging from relatively simple logistic regression approaches to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models, i.e., neural networks. Paired with functional validation, these methods show important promise for clinical translation, providing a strong evidence-based approach to direct post-GWAS research. However, as ML approaches continue to evolve to meet the challenge of causal gene identification, a critical assessment of the underlying methodologies and their applicability to the GWAS prioritization problem is needed. This review investigates the landscape of ML applications in three parts: selected models, input features, and output model performance, with a focus on prioritizations of complex disease associated loci. Overall, we explore the contributions ML has made towards reaching the GWAS end-game with consequent wide-ranging translational impact.

Keywords: artificial intelligence; candidate gene; clinical translation; data science; deep learning; genome-wide association study; genomics; machine learning.

PubMed Disclaimer

Figures

**FIGURE 1**
Supervised Machine Learning Algorithm Training. **(A)** Data containing labeled genes (e.g., genes labeled as causal or non-causal for blood pressure – BP) and columns of features describing those genes are input into a machine learning algorithm. Machine learning algorithms firstly initialize themselves by applying their rules to a subset of the data (deemed training data) and its features at random. E.g., an algorithm’s first practice iteration can involve assigning feature importance at random (importance denoted by size of feature image). The algorithm uses its feature initialization to classify genes into either affecting BP (red genes) or not affecting BP (blue genes). Algorithms then use the practice predictions to calculate loss (an error rate) and iterate over the data again with applying the previous iteration’s loss calculation to adjust feature handling **(B)**. With using the loss calculations the algorithm aims to improve predictive performance with each training iteration.

**FIGURE 2**
Supervised Machine Learning Models. Diagram detailing three machine learning model bases used in supervised learning, each providing varying algorithms most commonly used in post-GWAS prioritization.

See this image and copyright information in PMC

References

1. Aung N., Vargas J. D., Yang C., Cabrera C. P., Warren H. R., Fung K., et al. (2019). Genome-wide analysis of left ventricular image-derived phenotypes identifies fourteen loci associated with cardiac morphogenesis and heart failure development. Circulation 140 1318–1330. 10.1161/CIRCULATIONAHA.119.041161 - DOI - PMC - PubMed
1. Ayalew M., Le-Niculescu H., Levey D. F., Jain N., Changala B., Patel S. D., et al. (2012). Convergent functional genomics of schizophrenia: from comprehensive understanding to genetic risk prediction. Mol. Psychiatry 17 887–905. 10.1038/mp.2012.37 - DOI - PMC - PubMed
1. Banegas J. R., Lopez-Garcia E., Dallongeville J., Guallar E., Halcox J. P., Borghi C., et al. (2011). Achievement of treatment goals for primary prevention of cardiovascular disease in clinical practice across Europe: the EURIKA study. Eur. Heart J. 32 2143–2152. 10.1093/eurheartj/ehr080 - DOI - PMC - PubMed
1. Branco P. R., de Araujo G. S., Barrera J., Suarez-Kurtz G., de Souza S. J. (2018). Uncovering association networks through an eQTL analysis involving human miRNAs and lincRNAs. Sci. Rep. 8:15050. 10.1038/s41598-018-33420-z - DOI - PMC - PubMed
1. Breiman L. (2001). Random forests. Machine Learning 45 5–32. 10.1023/A:1010933404324 - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Affiliations

Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Research Materials