Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Mar 19;10 Suppl 3(Suppl 3):S5.
doi: 10.1186/1471-2105-10-S3-S5.

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Affiliations

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Jorge Amigo et al. BMC Bioinformatics. .

Abstract

Background: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies.

Results: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases.

Conclusion: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Number of samples present on the data mart. There are 3045 samples represented on our repository. The distribution of the number of samples per database vary from the most ambitious ones such as HapMap Phase III and the Stanford HGDP that contain over 1000 samples each, to others with less variation representation such as Perlegen, with only 71 samples on it.
Figure 2
Figure 2
Number of SNPs present on the data mart. Around 8.5 × 106 SNPs are processed from the different databases, although these SNPs are not independent. Considering the SNP codes sharing presented on Table 1, where the HapMap Phase II database is the major SNP contributor, the number of distinct SNPs represented on the data mart is close to 4.5 × 106.
Figure 3
Figure 3
Number of genotypes present on the data mart. A total of above 4 × 109 genotypes are summarized on our data mart. Although the number of samples on Perlegen is not very high, its SNP coverage is, transforming this database along with both HapMap phases into the major genotyping contributors with over 109 genotypes each.
Figure 4
Figure 4
Data mart tables for the HapMap Phase III database. Each database summarized is present on the data mart as a set of tables containing descriptive SNP information and population specific calculations. Every database will have all the table structures expanded at the top of the image, and the amount of the population specific ones shown with the "__pop__" label will depend on the amount of populations covered by the database. Only the CEU population table structure has been expanded on the image, but the rest of the population tables share the same structure that allows filling each population SNP with all the available counts and calculations performed by the raw data processing script.
Figure 5
Figure 5
Memory needed to process each database. The memory required to deal with different databases depends not only on their number of samples and SNPs, but also on the raw data files structure. Although not more than 1 GB of memory has been enough for most the databases, the Stanford data needed some more due to its high population coverage. The fact of containing so many samples and representing so many populations on single files per chromosome forced the processing script to store plenty of indexed information that demanded high computational resources. The optimized design of the variables, along with the strict memory handling of the script, minimized this issue never requiring more than 2 GB.
Figure 6
Figure 6
Databases' processing times. Cumulative time is presented, taking 12 hours to deal with all the available databases, although each task is independent from the others and therefore can be run in parallel. The maximum time would then be the 4 hours that the Michigan data needs to be processed.

References

    1. McNamee LA, Launsby BD, Frisse ME, Lehmann R, Ebker K. Scaling an expert system data mart: more facilities in real-time. Proc AMIA Symp. 1998:498–502. - PMC - PubMed
    1. Arnrich B, Walter J, Albert A, Ennker J, Ritter H. Data mart based research in heart surgery: challenges and benefit. Stud Health Technol Inform. 2004;107:8–12. - PubMed
    1. Phillips C. Online resources for SNP analysis: a review and route map. Mol Biotechnol. 2007;35:65–97. doi: 10.1385/MB:35:1:65. - DOI - PubMed
    1. Rosenberg NA. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 2006;70:841–847. doi: 10.1111/j.1469-1809.2006.00285.x. - DOI - PubMed
    1. Smith MW, O'Brien SJ. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet. 2005;6:623–632. doi: 10.1038/nrg1657. - DOI - PubMed

Publication types

LinkOut - more resources