Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Jorge Amigo¹, Christopher Phillips, Antonio Salas, Angel Carracedo

Affiliations

PMID: 19344481
PMCID: PMC2665053
DOI: 10.1186/1471-2105-10-S3-S5

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Jorge Amigo et al. BMC Bioinformatics. 2009.

. 2009 Mar 19;10 Suppl 3(Suppl 3):S5.

doi: 10.1186/1471-2105-10-S3-S5.

Authors

Jorge Amigo¹, Christopher Phillips, Antonio Salas, Angel Carracedo

Affiliation

¹ Spanish National Genotyping Center (CeGen), Genomic Medicine Group, CIBERER, University of Santiago de Compostela, Galicia, Spain. jorge.amigo@usc.es

PMID: 19344481
PMCID: PMC2665053
DOI: 10.1186/1471-2105-10-S3-S5

Abstract

Background: Databases containing very large amounts of SNP (Single Nucleotide Polymorphism) data are now freely available for researchers interested in medical and/or population genetics applications. While many of these SNP repositories have implemented data retrieval tools for general-purpose mining, these alone cannot cover the broad spectrum of needs of most medical and population genetics studies.

Results: To address this limitation, we have built in-house customized data marts from the raw data provided by the largest public databases. In particular, for population genetics analysis based on genotypes we have built a set of data processing scripts that deal with raw data coming from the major SNP variation databases (e.g. HapMap, Perlegen), stripping them into single genotypes and then grouping them into populations, then merged with additional complementary descriptive information extracted from dbSNP. This allows not only in-house standardization and normalization of the genotyping data retrieved from different repositories, but also the calculation of statistical indices from simple allele frequency estimates to more elaborate genetic differentiation tests within populations, together with the ability to combine population samples from different databases.

Conclusion: The present study demonstrates the viability of implementing scripts for handling extensive datasets of SNP genotypes with low computational costs, dealing with certain complex issues that arise from the divergent nature and configuration of the most popular SNP repositories. The information contained in these databases can also be enriched with additional information obtained from other complementary databases, in order to build a dedicated data mart. Updating the data structure is straightforward, as well as permitting easy implementation of new external data and the computation of supplementary statistical indices of interest.

PubMed Disclaimer

Figures

**Figure 1**
**Number of samples present on the data mart**. There are 3045 samples represented on our repository. The distribution of the number of samples per database vary from the most ambitious ones such as HapMap Phase III and the Stanford HGDP that contain over 1000 samples each, to others with less variation representation such as Perlegen, with only 71 samples on it.

**Figure 2**
**Number of SNPs present on the data mart**. Around 8.5 × 10⁶SNPs are processed from the different databases, although these SNPs are not independent. Considering the SNP codes sharing presented on Table 1, where the HapMap Phase II database is the major SNP contributor, the number of distinct SNPs represented on the data mart is close to 4.5 × 10⁶.

**Figure 3**
**Number of genotypes present on the data mart**. A total of above 4 × 10⁹genotypes are summarized on our data mart. Although the number of samples on Perlegen is not very high, its SNP coverage is, transforming this database along with both HapMap phases into the major genotyping contributors with over 10⁹genotypes each.

**Figure 4**
**Data mart tables for the HapMap Phase III database**. Each database summarized is present on the data mart as a set of tables containing descriptive SNP information and population specific calculations. Every database will have all the table structures expanded at the top of the image, and the amount of the population specific ones shown with the "__pop__" label will depend on the amount of populations covered by the database. Only the CEU population table structure has been expanded on the image, but the rest of the population tables share the same structure that allows filling each population SNP with all the available counts and calculations performed by the raw data processing script.

**Figure 5**
**Memory needed to process each database**. The memory required to deal with different databases depends not only on their number of samples and SNPs, but also on the raw data files structure. Although not more than 1 GB of memory has been enough for most the databases, the Stanford data needed some more due to its high population coverage. The fact of containing so many samples and representing so many populations on single files per chromosome forced the processing script to store plenty of indexed information that demanded high computational resources. The optimized design of the variables, along with the strict memory handling of the script, minimized this issue never requiring more than 2 GB.

**Figure 6**
**Databases' processing times**. Cumulative time is presented, taking 12 hours to deal with all the available databases, although each task is independent from the others and therefore can be run in parallel. The maximum time would then be the 4 hours that the Michigan data needs to be processed.

See this image and copyright information in PMC

References

1. McNamee LA, Launsby BD, Frisse ME, Lehmann R, Ebker K. Scaling an expert system data mart: more facilities in real-time. Proc AMIA Symp. 1998:498–502. - PMC - PubMed
1. Arnrich B, Walter J, Albert A, Ennker J, Ritter H. Data mart based research in heart surgery: challenges and benefit. Stud Health Technol Inform. 2004;107:8–12. - PubMed
1. Phillips C. Online resources for SNP analysis: a review and route map. Mol Biotechnol. 2007;35:65–97. doi: 10.1385/MB:35:1:65. - DOI - PubMed
1. Rosenberg NA. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet. 2006;70:841–847. doi: 10.1111/j.1469-1809.2006.00285.x. - DOI - PubMed
1. Smith MW, O'Brien SJ. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet. 2005;6:623–632. doi: 10.1038/nrg1657. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Affiliation

Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources