A machine-compiled database of genome-wide association studies

Volodymyr Kuleshov^{1

2}, Jialin Ding³, Christopher Vo³, Braden Hancock³, Alexander Ratner³, Yang Li⁴, Christopher Ré³, Serafim Batzoglou³, Michael Snyder⁵

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
² Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
³ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA.
⁴ Department of Medicine, University of Chicago, Chicago, IL, 60637, USA.
⁵ Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA.

PMID: 31350405
PMCID: PMC6659642
DOI: 10.1038/s41467-019-11026-x

A machine-compiled database of genome-wide association studies

Volodymyr Kuleshov et al. Nat Commun. 2019.

. 2019 Jul 26;10(1):3341.

doi: 10.1038/s41467-019-11026-x.

Authors

Volodymyr Kuleshov^{1

2}, Jialin Ding³, Christopher Vo³, Braden Hancock³, Alexander Ratner³, Yang Li⁴, Christopher Ré³, Serafim Batzoglou³, Michael Snyder⁵

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
² Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA. kuleshov@cs.stanford.edu.
³ Department of Computer Science, Stanford University, Stanford, CA, 94305, USA.
⁴ Department of Medicine, University of Chicago, Chicago, IL, 60637, USA.
⁵ Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA.

PMID: 31350405
PMCID: PMC6659642
DOI: 10.1038/s41467-019-11026-x

Abstract

Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
The automated information extraction system used to compile GWASkb. The GWASkb system takes as input a set of biomedical publications retrieved from PubMed Central (left) and automatically creates a structured database of GWAS associations described in these publications (right). For each association, the system identifies a genetic variant (purple), a high-level phenotype (pertaining to all variants in the publication), a detailed low-level phenotype (specific to individual variants, if available; red), and a p value (orange). Acronyms are also resolved (red)

**Fig. 2**
General structure of a GWASkb system module. The system contains separate modules for extracting variants, phenotypes, p values, and for resolving acronyms. Each module consists of three stages. At the parsing stage, we process papers using the Stanford CoreNLP pipeline, performing full syntactic parsing. Next, given a target relation (e.g., variant-phenotype), we generate a large set of candidates, some of which could be correct instances of the target object on relation. Then, at the classification stage, we determine which candidates are correct using a machine learning classifier

**Fig. 3**
Linkage disequilibrium between GWASkb variants not present in existing human curated databases and variants from the GWAS Catalog. We use the 1000 Genomes dataset to estimate the r² metric between pairs of variants, and report distances from each GWASkb variant to the most correlated GWAS Catalog SNP reported in the same paper. The distribution of r² scores is highly multimodal; many GWASkb variants are uncorrelated (r² = 0) with GWAS Catalog SNPs. Reported p values are generated from χ² test

**Fig. 4**
Visualizing the effect sizes of variants identified in GWASkb. *Top:* We compare the distribution of effect sizes (absolute values of beta coefficients or log odds ratios; data from LD Hub) of variants identified in GWASkb (blue) to that of all variants (green) for multiple traits. Blue variant effect sizes cluster away from zero and follow a different distribution (Kolmogorov–Smirnov test). In the boxplots, center lines represent medians, the box boundaries span the interquartile range, and the whiskers extend to the minimum and maximum observations excluding statistical outliers. *Bottom:* We subsample 1000 random sets of variants with the same number of elements as the set of GWASkb SNPs for a given disease; the average effect size of GWASkb variants (red) is higher than that of the random subsets (blue). In all settings, we only look at novel GWASkb variants not present in existing human-curated repositories

See this image and copyright information in PMC

References

1. Bush WS, Moore JH. Chapter 11: genome-wide association studies. PLoS Comput. Biol. 2012;8:1–11. doi: 10.1371/journal.pcbi.1002822. - DOI - PMC - PubMed
1. Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. - DOI - PMC - PubMed
1. Beck T, Hastings RK, Gollapudi S, Free RC, Brookes AJ. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 2013;22:949–952. doi: 10.1038/ejhg.2013.274. - DOI - PMC - PubMed
1. Cariaso M, Lennon G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 2012;40:D1308–D1312. doi: 10.1093/nar/gkr798. - DOI - PMC - PubMed
1. Promethease. https://promethease.com/ (2019)

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A machine-compiled database of genome-wide association studies

Affiliations

A machine-compiled database of genome-wide association studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources