Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov 1;27(21):2987-93.
doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8.

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Affiliations

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Heng Li. Bioinformatics. .

Abstract

Motivation: Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty.

Results: We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors.

Availability: http://samtools.sourceforge.net.

Contact: hengli@broadinstitute.org.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Correlation of the site allele count accuracy with LD. The site allele count is estimated with Beagle imputation (solid line) and with Equation (16) (dashed line) at sites typed by the Omni genotyping chip. For each Omni SNP, the maximum r2 LD statistic between the SNP and 20 nearby SNPs called by SAMtools (10 upstream and 10 downstream) is computed from imputed genotypes. Omni SNPs are then ordered by the maximum r2 and approximately evenly divided into 15 bins. For each bin, the RMSD between the Omni allele count and the estimated allele count is computed as a measurement of the allele count accuracy.
Fig. 2.
Fig. 2.
The derived AFS conditional on heterozygotes discovered in the NA18507 genome (Bentley et al., 2008; AC:SRA000271). Heterozygotes were called with SAMtools on BWA (Li and Durbin, 2009) alignment. The ancestral sequences were determined from the Ensembl EPO alignment (Paten et al., 2008), with the requirement of the chimpanzee and orangutan sequences being identical. The AFS at these heterozygotes were computed in three ways: (i) from the nine independent Yoruba individuals sequenced by Complete Genomics (Drmanac et al., 2010) and analyzed using CGA Tools version 1.10.0; (ii) from nine random Pilot-1 Yoruba individuals released by the 1000 Genomes Project using the EM-AFS method and (iii) from the same 9 Pilot-1 individuals using site-AFS.
Fig. 3.
Fig. 3.
QQ-plot comparing the association test statistics to the one-degree and the two-degree χ2 distribution. The 49 CEU samples sequenced by the 1000 Genomes Project using the Illumina technology were randomly assigned to two groups of size 24 and 25, respectively. (A) two association test statistics were computed on chromosome 20 between the two groups: one by the one-degree likelihood ratio test [Equation (11)] and the other by the canonical one-degree χ2 test based on Beagle imputed genotypes; (B) the two-degree likelihood rate test statistic [Equation (12)].

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. Ajay SS, et al. Accurate and comprehensive sequencing of personal genomes. Genome Res. 2011;21:1498–1505. - PMC - PubMed
    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Brent RP. Algorithms for Minimization without Derivatives. Englewood Cliffs, New Jersey: Prentice-Hall; 1973.
    1. Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 2009;85:847–861. - PMC - PubMed

Publication types