Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Olivier Delaneau¹, Jonathan Marchini²; 1000 Genomes Project Consortium; 1000 Genomes Project Consortium

Collaborators, Affiliations

PMID: 25653097
PMCID: PMC4338501
DOI: 10.1038/ncomms4934

Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Olivier Delaneau et al. Nat Commun. 2014.

. 2014 Jun 13:5:3934.

doi: 10.1038/ncomms4934.

PMID: 25653097
PMCID: PMC4338501
DOI: 10.1038/ncomms4934

Abstract

A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.

PubMed Disclaimer

Figures

**Figure 1. Methods comparison of genotype discordance and imputation accuracy using the CG1 data**
Panel (a) shows the discordance at chr20 CG1 SNP genotypes of Beagle (green), Thunder (orange) and SHAPEIT2 without using a scaffold (light blue), using a 1M SNPs haplotype scaffold (medium blue) and using a 2.5M SNPs haplotype scaffold (dark blue). ALT stands for the discordance at genotypes involving at least one non-reference allele, and ALL for the overall discordance. Panel (b) shows the performance of the previous call sets when used as a reference panel to impute 4 CG1 European genotyped on Illumina 1M SNP array. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R².

**Figure 2. Methods comparison of genotype discordance and imputation accuracy using the CG2 data**
Panel (a) shows the whole genome genotype discordance of Beagle (green), Thunder (orange) and SHAPEIT2 using a 2.5M SNPs haplotype scaffold (dark blue) at CG2 SNPs. Panel (b) shows the performance of the 3 call sets to impute SNPs on chromosome 10 in 10 CG2 European samples typed on Illumina 1M and Omni2.5M chips. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R². Panels (c) and (d) show similar results than panels (a) and (b), respectively for short bi-allelic indels instead of SNPs.

**Figure 3. Imputation accuracy at SNPs and Indels using the CG2 data**
The imputation performance at SNPs and indels are shown with the orange and green lines, respectively. Performance at all indels, isolated indels and non-isolated indels are shown using plain, dashed and dotted lines. An indel is isolated when no other indels is in the 50bp flanking regions. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R².

See this image and copyright information in PMC

References

1. The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
1. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 2011;12:443–451. - PMC - PubMed
1. Browning B, Browning S. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009;84:210–223. - PMC - PubMed
1. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. - PMC - PubMed
1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources