Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May;30(5):1145-58.
doi: 10.1093/molbev/mst016. Epub 2013 Jan 30.

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Affiliations

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Darren Kessner et al. Mol Biol Evol. 2013 May.

Abstract

DNA samples are often pooled, either by experimental design or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g., bacterial species comprising a microbiome or pathogen strains in a blood sample). We present an expectation-maximization algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different species within a metagenomics sample. Our method outperforms existing methods based on single-site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Haplotype information from individual reads can be combined across a genomic region to obtain haplotype frequency estimates. In this cartoon, there are four known haplotypes (black, green, blue, and orange), with sequence data coming from a pool containing 25% green, 25% blue, and 50% orange haplotypes. Each read is probabilistically assigned to the known haplotypes. Some reads can be assigned with great certainty, for example, the reads coming from the blue haplotype that cover two neighboring variant sites. Other reads (represented by two colors) are assigned with less certainty.
F<sc>ig</sc>. 2.
Fig. 2.
Comparison of the EM algorithm to known allele-frequency-based and simple-sequence-based methods. Each algorithm was run on simulated pooled 100-bp paired-end sequence data from 20 haplotypes at 200× coverage, with 100 replicates for each region width.
F<sc>ig</sc>. 3.
Fig. 3.
Performance of the EM algorithm increases with coverage, region width, and read length and is robust to sequencing errors. (A) Performance of the EM algorithm increases with both coverage and width of the region used for the estimation. (B) The EM algorithm performs better with longer reads, which provide more haplotype information. (C) The EM algorithm maintains good performance with increasing sequence read error rate. Empirical error rates were found to be in the range of 0.05–0.07 errors per base call. In all simulations, we simulated paired-end pooled sequence data from 162 haplotypes at randomly drawn frequencies, with 100 replicates per parameter value level. Nonvarying parameters were held at fixed values representative of our experimental data (read length 100 bp, read error rate 0.06, coverage 200×, and region width 200 kb).
F<sc>ig</sc>. 4.
Fig. 4.
The EM algorithm performs best when the true frequency distribution has low entropy (nonuniform, with a few haplotypes at high frequencies, with the rest at low frequencies). The algorithm was run on simulated pooled 100-bp paired-end sequence data from 162 haplotypes at 200× coverage in a 200 kb region (550 replicates binned by Shannon entropy in natural log units).
F<sc>ig</sc>. 5.
Fig. 5.
Recalibration of base quality scores using monomorphic sites improves performance. (A) Reported base quality scores do not match empirical scores calculated from real data using monomorphic sites. (B) The EM algorithm was run with and without base quality score recalibration on simulated pooled 100-bp paired-end sequence data from 162 haplotypes at 200× coverage in a 200 kb region (100 replicates each). Sequence errors in the simulated data were introduced with probabilities given by the empirical error rates.
F<sc>ig</sc>. 6.
Fig. 6.
Performance of the EM algorithm on the calculation of species-level abundances from 16S rRNA sequence data. The algorithm was run on simulated 75-bp single-end 16S sequence data from pools of 200 randomly chosen microbial species, with 100 replicates for each coverage level.
F<sc>ig</sc>. 7.
Fig. 7.
Large numbers of unknown unrelated species do not significantly affect the estimate of within-genus species frequencies. The EM algorithm with haplotype likelihood filter was run on simulated 75-bp single-end 16S sequence reads from 500 species (20 Clostridium species [known] and 480 non-Clostridium species [unknown]). “Unknown proportion” is the total proportion of reads coming uniformly at random from the 480 unknown species, with the remainder of the reads coming from the 20 known species (100× pooled coverage, 100 replicates for each unknown proportion level).
F<sc>ig</sc>. 8.
Fig. 8.
Sequence reads from unknown unrelated species push frequency estimates toward the uniform distribution; filtering the reads based on haplotype likelihood minimizes this effect. (A) Sequence reads from unknown unrelated species have low maximum haplotype likelihoods, giving rise to the long left tail of the distribution. By calculating the theoretical distribution (blue) of maximum haplotype likelihoods based on the base quality scores from the sequence data, reads whose maximum haplotype likelihood falls below a specified threshold (red, z-score threshold = −2 in this example) can be filtered out. (B) A typical example of this effect, with 50% of the reads coming from the unknown species. (C) Without filtering on the haplotype likelihoods, error in frequency estimates increases with higher proportion of unknown sequence reads. Seventy-five-base-pair single-end 16S sequence reads were simulated from 20 species, with varying proportions of reads from an unknown unrelated species (100× pooled coverage, 100 replicates per unknown proportion level).
F<sc>ig</sc>. 9.
Fig. 9.
When the pool contains an unknown species that is related to one of the known species, sequence reads from the unknown increase the frequency estimate of the most closely related known species but have little effect on the estimation of the relative frequencies of the other known species. (A) Sequence reads from an unknown species (black) related to one of the known species (orange) contribute to the frequency estimate of that species. Shown is a typical example of this effect, with 50% of the reads coming from the unknown species. The relative abundances of the other known species are estimated accurately. (B) Presence of the unknown species has little effect on the estimation of the relative frequencies of the other (unrelated) known species. Seventy-five-base-pair single-end 16S sequence reads were simulated from 20 known species and 1 unknown related species (100× pooled coverage, 100 replicates for each unknown proportion level).

Similar articles

Cited by

References

    1. Burke MK, Dunham JP, Shahrestani P, Thornton KR, Rose MR, Long AD. Genome-wide analysis of a long-term evolution experiment with Drosophila. Nature. 2010;467:587–590. - PubMed
    1. Cheesman SJ, de Roode JC, Read AF, Carter R. Real-time quantitative PCR for analysis of genetically mixed infections of malaria parasites: technique validation and applications. Mol Biochem Parasitol. 2003;131:83–91. - PubMed
    1. Cutler DJ, Jensen JD. To pool, or not to pool? Genetics. 2010;186:41–43. - PMC - PubMed
    1. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.
    1. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72:5069–5072. - PMC - PubMed

Publication types

Substances

LinkOut - more resources