Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 26;10(1):3341.
doi: 10.1038/s41467-019-11026-x.

A machine-compiled database of genome-wide association studies

Affiliations

A machine-compiled database of genome-wide association studies

Volodymyr Kuleshov et al. Nat Commun. .

Abstract

Tens of thousands of genotype-phenotype associations have been discovered to date, yet not all of them are easily accessible to scientists. Here, we describe GWASkb, a machine-compiled knowledge base of genetic associations collected from the scientific literature using automated information extraction algorithms. Our information extraction system helps curators by automatically collecting over 6,000 associations from open-access publications with an estimated recall of 60-80% and with an estimated precision of 78-94% (measured relative to existing manually curated knowledge bases). This system represents a fully automated GWAS curation effort and is made possible by a paradigm for constructing machine learning systems called data programming. Our work represents a step towards making the curation of scientific literature more efficient using automated systems.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The automated information extraction system used to compile GWASkb. The GWASkb system takes as input a set of biomedical publications retrieved from PubMed Central (left) and automatically creates a structured database of GWAS associations described in these publications (right). For each association, the system identifies a genetic variant (purple), a high-level phenotype (pertaining to all variants in the publication), a detailed low-level phenotype (specific to individual variants, if available; red), and a p value (orange). Acronyms are also resolved (red)
Fig. 2
Fig. 2
General structure of a GWASkb system module. The system contains separate modules for extracting variants, phenotypes, p values, and for resolving acronyms. Each module consists of three stages. At the parsing stage, we process papers using the Stanford CoreNLP pipeline, performing full syntactic parsing. Next, given a target relation (e.g., variant-phenotype), we generate a large set of candidates, some of which could be correct instances of the target object on relation. Then, at the classification stage, we determine which candidates are correct using a machine learning classifier
Fig. 3
Fig. 3
Linkage disequilibrium between GWASkb variants not present in existing human curated databases and variants from the GWAS Catalog. We use the 1000 Genomes dataset to estimate the r2 metric between pairs of variants, and report distances from each GWASkb variant to the most correlated GWAS Catalog SNP reported in the same paper. The distribution of r2 scores is highly multimodal; many GWASkb variants are uncorrelated (r2 = 0) with GWAS Catalog SNPs. Reported p values are generated from χ2 test
Fig. 4
Fig. 4
Visualizing the effect sizes of variants identified in GWASkb. Top: We compare the distribution of effect sizes (absolute values of beta coefficients or log odds ratios; data from LD Hub) of variants identified in GWASkb (blue) to that of all variants (green) for multiple traits. Blue variant effect sizes cluster away from zero and follow a different distribution (Kolmogorov–Smirnov test). In the boxplots, center lines represent medians, the box boundaries span the interquartile range, and the whiskers extend to the minimum and maximum observations excluding statistical outliers. Bottom: We subsample 1000 random sets of variants with the same number of elements as the set of GWASkb SNPs for a given disease; the average effect size of GWASkb variants (red) is higher than that of the random subsets (blue). In all settings, we only look at novel GWASkb variants not present in existing human-curated repositories

References

    1. Bush WS, Moore JH. Chapter 11: genome-wide association studies. PLoS Comput. Biol. 2012;8:1–11. doi: 10.1371/journal.pcbi.1002822. - DOI - PMC - PubMed
    1. Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. - DOI - PMC - PubMed
    1. Beck T, Hastings RK, Gollapudi S, Free RC, Brookes AJ. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 2013;22:949–952. doi: 10.1038/ejhg.2013.274. - DOI - PMC - PubMed
    1. Cariaso M, Lennon G. SNPedia: a wiki supporting personal genome annotation, interpretation and analysis. Nucleic Acids Res. 2012;40:D1308–D1312. doi: 10.1093/nar/gkr798. - DOI - PMC - PubMed
    1. Promethease. https://promethease.com/ (2019)

Publication types