. 2011;6(10):e24982.

doi: 10.1371/journal.pone.0024982. Epub 2011 Oct 19.

SNPpy--database management for SNP data from genome wide association studies

Faheem Mitha¹, Herodotos Herodotou, Nedyalko Borisov, Chen Jiang, Josh Yoder, Kouros Owzar

Affiliations

PMID: 22039405
PMCID: PMC3198468
DOI: 10.1371/journal.pone.0024982

SNPpy--database management for SNP data from genome wide association studies

Faheem Mitha et al. PLoS One. 2011.

. 2011;6(10):e24982.

doi: 10.1371/journal.pone.0024982. Epub 2011 Oct 19.

Authors

Faheem Mitha¹, Herodotos Herodotou, Nedyalko Borisov, Chen Jiang, Josh Yoder, Kouros Owzar

Affiliation

¹ Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, United States of America. faheem@faheem.info

PMID: 22039405
PMCID: PMC3198468
DOI: 10.1371/journal.pone.0024982

Abstract

Background: We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software.

Results: The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses.

Conclusions: By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 2. Database Schema.**
*Geno Single* database schema for the Affymetrix platform. In this diagram, the rectangles correspond to database tables, and the rows in each rectangle correspond to database table columns. The four columns in a row correspond to, from left to right, database name (column 1), data type (column 2), primary key indicator (column 3), and foreign key indicator (column 4). The arrows correspond to foreign keys. Observe the number of arrows leaving a table is equal to the number of columns that are foreign keys in that table.

**Figure 3. Database Layout.**
Datasets for different platforms are stored in separate databases, here represented by cylinders. Every dataset is stored in a separate database schema (namespace within a database). The same dataset can be stored in multiple schemas, differing in what options have been selected when loading the dataset. To illustrate this, the figure shows the schemas in red and the datasets in black. Each of the datasets *HapMap 6* and *CEU HapMap 610* is stored in two schemas. For further details see the manual.

**Figure 4. Dataset load timings.**
Timings for loading simulated datasets for the Illumina platform into the database, for the *Geno Single* layout, and the *Geno Shard* layout with degree of parallelism and . For all these datasets, the number of SNPs is 620,901.

formula image — **Figure 4. Dataset load timings.**
Timings for loading simulated datasets for the Illumina platform into the database, for the *Geno Single* layout, and the *Geno Shard* layout with degree of parallelism and . For all these datasets, the number of SNPs is 620,901.

**Figure 5. PED file write timings.**
Timings for writing PED files from simulated datasets for the Illumina platform, for the *Geno Single* layout with degrees of parallelism and *Geno Shard* layout with degree of parallelism and . For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.

**Figure 6. PED file merged write timings.**
Timing results for writing the PED file corresponding to the merger of the 2000 patient Illumina simulated dataset with the corresponding HapMap datasets compared to timings for writing the PED file for each of the 2,000 patient simulated dataset and the Hapmap dataset. All these timings are for the *Geno Shard* layout. For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.

See this image and copyright information in PMC

References

1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. - PMC - PubMed
1. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23:1294–1296. - PubMed
1. GLU: Genotype Library and Utilities. URL http://code.google.com/p/glu-genetics/. Accessed: 2011 September 27th.
1. Conway JE. PL/R: a PostgreSQL loadable procedural language handler for the R programming language. 2009. URL http://www.joeconway.com/plr/. Version 8.3.0.8. Accessed: 2011 September 27th.
1. Wellcome Trust Case Control Consortium. URL https://www.wtccc.org.uk/info/access_to_data_samples.shtml. Accessed: 2011 September 27th.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

U01GM061393/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SNPpy--database management for SNP data from genome wide association studies

Affiliation

SNPpy--database management for SNP data from genome wide association studies

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources