Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Oct;195(2):553-61.
doi: 10.1534/genetics.113.154500. Epub 2013 Aug 9.

A novel approach to estimating heterozygosity from low-coverage genome sequence

Affiliations

A novel approach to estimating heterozygosity from low-coverage genome sequence

Katarzyna Bryc et al. Genetics. 2013 Oct.

Abstract

High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual are limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step that calls genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first, by its performance on simulated sequence data and, second, on real sequence data where we obtain estimates using low-coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse worldwide populations and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show how we can use filters to correct for the confounding arising from sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher-coverage data.

Keywords: heterozygosity; low-coverage sequence data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
True vs. estimated rates of heterozygosity for 100 simulated read data sets. (A) Each data set has been downsampled to different coverage levels, denoted by symbol color and shape. The red line corresponds to true = estimated or perfect estimation of heterozygosity. (B) Run-by-run differences between true and estimated heterozygosity rates, stratified by downsampling coverage. The y-axis shows the percentage of error from the true value of heterozygosity.
Figure 2
Figure 2
Our EM heterozygosity estimates (red) and MlRho estimates (blue) on the regions of a San individual genome sequenced to 30–45X and randomly downsampled. At higher coverage, both methods converge to an estimate of 7.45 × 10−4. We note that our estimates for 4X and 5X coverage are much more accurate than those of MlRho. Results for <4X coverage were not possible to obtain from MlRho.
Figure 3
Figure 3
Estimates of heterozygosity for CEU trio individual NA12892, using a variety of reference panel compositions. (A) Heterozygosity estimates for each of the reference panels composed of five individuals from different populations as described in Proof of principle 2: Downsampling high-coverage genomes. (B) Heterozygosity estimates using different reference panel size (x-axis) and downsampled to different coverage (symbol color/shape).
Figure 4
Figure 4
Estimates of heterozygosity for each of the 11 present-day human genomes and Denisova, where each individual is denoted by a unique color. Relative coverage is defined as the lower bound of the sequencing bin, divided by the mean sequencing depth for the individual. (A) Heterozygosity estimates are consistent across downsampling levels. Downsampling to 5X, 10X, and 20X levels is denoted by line type. Each individual is denoted by line color. (B) All individuals show an increase in estimated heterozygosity at higher (and lower) relative coverage. (C) Effect of removing known regions with segmental duplications. Estimates of heterozygosity are shown for a sample of five of the individuals. Without filtering, estimates for each bin are shown with solid lines. After exclusion of regions within known copy-number-variable and segmental duplications, the heterozygosity estimates display a flatter distribution (dashed lines).
Figure 5
Figure 5
Inferred distribution of homozygous ancestral (red), heterozygous (yellow), and homozygous derived (blue) sites for the San HGDP individual. The y-axis is presented on a log scale, and counts with expected value <0.1 have been omitted from the plot.

References

    1. Chen G., Marjoram P., Wall J., 2009. Fast and flexible simulation of DNA sequence data. Genome Res. 19: 136–142. - PMC - PubMed
    1. Gutenkunst R., Hernandez R., Williamson S., Bustamante C., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 5: e1000695. - PMC - PubMed
    1. Haubold B., Pfaffelhuber P., Lynch M., 2010. mlRho—a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes. Mol. Ecol. 19: 277–284. - PMC - PubMed
    1. Hellmann I., Mang Y., Gu Z., Li P., Francisco M., et al. , 2008. Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18: 1020–1029. - PMC - PubMed
    1. Jakobsson M., Scholz S. W., Scheet P., Gibbs J. R., VanLiere J. M., et al. , 2008. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451: 998–1003. - PubMed

Publication types

LinkOut - more resources