Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(10):e24982.
doi: 10.1371/journal.pone.0024982. Epub 2011 Oct 19.

SNPpy--database management for SNP data from genome wide association studies

Affiliations

SNPpy--database management for SNP data from genome wide association studies

Faheem Mitha et al. PLoS One. 2011.

Abstract

Background: We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software.

Results: The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses.

Conclusions: By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Workflow Chart.
This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.
Figure 2
Figure 2. Database Schema.
Geno Single database schema for the Affymetrix platform. In this diagram, the rectangles correspond to database tables, and the rows in each rectangle correspond to database table columns. The four columns in a row correspond to, from left to right, database name (column 1), data type (column 2), primary key indicator (column 3), and foreign key indicator (column 4). The arrows correspond to foreign keys. Observe the number of arrows leaving a table is equal to the number of columns that are foreign keys in that table.
Figure 3
Figure 3. Database Layout.
Datasets for different platforms are stored in separate databases, here represented by cylinders. Every dataset is stored in a separate database schema (namespace within a database). The same dataset can be stored in multiple schemas, differing in what options have been selected when loading the dataset. To illustrate this, the figure shows the schemas in red and the datasets in black. Each of the datasets HapMap 6 and CEU HapMap 610 is stored in two schemas. For further details see the manual.
Figure 4
Figure 4. Dataset load timings.
Timings for loading simulated datasets for the Illumina platform into the database, for the Geno Single layout, and the Geno Shard layout with degree of parallelism formula image and formula image. For all these datasets, the number of SNPs is 620,901.
Figure 5
Figure 5. PED file write timings.
Timings for writing PED files from simulated datasets for the Illumina platform, for the Geno Single layout with degrees of parallelism formula image and Geno Shard layout with degree of parallelism formula image and formula image. For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.
Figure 6
Figure 6. PED file merged write timings.
Timing results for writing the PED file corresponding to the merger of the 2000 patient Illumina simulated dataset with the corresponding HapMap datasets compared to timings for writing the PED file for each of the 2,000 patient simulated dataset and the Hapmap dataset. All these timings are for the Geno Shard layout. For all these datasets, the number of SNPs is 620,901. All timings correspond to warm cache.

References

    1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. - PMC - PubMed
    1. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23:1294–1296. - PubMed
    1. GLU: Genotype Library and Utilities. URL http://code.google.com/p/glu-genetics/. Accessed: 2011 September 27th.
    1. Conway JE. PL/R: a PostgreSQL loadable procedural language handler for the R programming language. 2009. URL http://www.joeconway.com/plr/. Version 8.3.0.8. Accessed: 2011 September 27th.
    1. Wellcome Trust Case Control Consortium. URL https://www.wtccc.org.uk/info/access_to_data_samples.shtml. Accessed: 2011 September 27th.

Publication types