Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 13:5:3934.
doi: 10.1038/ncomms4934.

Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Collaborators, Affiliations

Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

Olivier Delaneau et al. Nat Commun. .

Abstract

A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Methods comparison of genotype discordance and imputation accuracy using the CG1 data
Panel (a) shows the discordance at chr20 CG1 SNP genotypes of Beagle (green), Thunder (orange) and SHAPEIT2 without using a scaffold (light blue), using a 1M SNPs haplotype scaffold (medium blue) and using a 2.5M SNPs haplotype scaffold (dark blue). ALT stands for the discordance at genotypes involving at least one non-reference allele, and ALL for the overall discordance. Panel (b) shows the performance of the previous call sets when used as a reference panel to impute 4 CG1 European genotyped on Illumina 1M SNP array. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R2.
Figure 2
Figure 2. Methods comparison of genotype discordance and imputation accuracy using the CG2 data
Panel (a) shows the whole genome genotype discordance of Beagle (green), Thunder (orange) and SHAPEIT2 using a 2.5M SNPs haplotype scaffold (dark blue) at CG2 SNPs. Panel (b) shows the performance of the 3 call sets to impute SNPs on chromosome 10 in 10 CG2 European samples typed on Illumina 1M and Omni2.5M chips. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R2. Panels (c) and (d) show similar results than panels (a) and (b), respectively for short bi-allelic indels instead of SNPs.
Figure 3
Figure 3. Imputation accuracy at SNPs and Indels using the CG2 data
The imputation performance at SNPs and indels are shown with the orange and green lines, respectively. Performance at all indels, isolated indels and non-isolated indels are shown using plain, dashed and dotted lines. An indel is isolated when no other indels is in the 50bp flanking regions. The x-axis shows the non-reference allele frequency of the SNP being imputed. The y-axis shows imputation accuracy measure by aggregate R2.

References

    1. The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 2011;12:443–451. - PMC - PubMed
    1. Browning B, Browning S. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009;84:210–223. - PMC - PubMed
    1. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 2010;34:816–834. - PMC - PubMed
    1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010;11:499–511. - PubMed

Publication types