. 2013 May;30(5):1145-58.

doi: 10.1093/molbev/mst016. Epub 2013 Jan 30.

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Darren Kessner¹, Thomas L Turner, John Novembre

Affiliations

PMID: 23364324
PMCID: PMC3670732
DOI: 10.1093/molbev/mst016

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Darren Kessner et al. Mol Biol Evol. 2013 May.

. 2013 May;30(5):1145-58.

doi: 10.1093/molbev/mst016. Epub 2013 Jan 30.

Authors

Darren Kessner¹, Thomas L Turner, John Novembre

Affiliation

¹ Bioinformatics Interdepartmental Program, University of California, Los Angeles, USA.

PMID: 23364324
PMCID: PMC3670732
DOI: 10.1093/molbev/mst016

Abstract

DNA samples are often pooled, either by experimental design or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g., bacterial species comprising a microbiome or pathogen strains in a blood sample). We present an expectation-maximization algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different species within a metagenomics sample. Our method outperforms existing methods based on single-site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1. — **Fig. 1.**
Haplotype information from individual reads can be combined across a genomic region to obtain haplotype frequency estimates. In this cartoon, there are four known haplotypes (black, green, blue, and orange), with sequence data coming from a pool containing 25% green, 25% blue, and 50% orange haplotypes. Each read is probabilistically assigned to the known haplotypes. Some reads can be assigned with great certainty, for example, the reads coming from the blue haplotype that cover two neighboring variant sites. Other reads (represented by two colors) are assigned with less certainty.

F<sc>ig</sc>. 2. — **Fig. 2.**
Comparison of the EM algorithm to known allele-frequency-based and simple-sequence-based methods. Each algorithm was run on simulated pooled 100-bp paired-end sequence data from 20 haplotypes at 200× coverage, with 100 replicates for each region width.

F<sc>ig</sc>. 3. — **Fig. 3.**
Performance of the EM algorithm increases with coverage, region width, and read length and is robust to sequencing errors. (A) Performance of the EM algorithm increases with both coverage and width of the region used for the estimation. (B) The EM algorithm performs better with longer reads, which provide more haplotype information. (C) The EM algorithm maintains good performance with increasing sequence read error rate. Empirical error rates were found to be in the range of 0.05–0.07 errors per base call. In all simulations, we simulated paired-end pooled sequence data from 162 haplotypes at randomly drawn frequencies, with 100 replicates per parameter value level. Nonvarying parameters were held at fixed values representative of our experimental data (read length 100 bp, read error rate 0.06, coverage 200×, and region width 200 kb).

F<sc>ig</sc>. 4. — **Fig. 4.**
The EM algorithm performs best when the true frequency distribution has low entropy (nonuniform, with a few haplotypes at high frequencies, with the rest at low frequencies). The algorithm was run on simulated pooled 100-bp paired-end sequence data from 162 haplotypes at 200× coverage in a 200 kb region (550 replicates binned by Shannon entropy in natural log units).

F<sc>ig</sc>. 5. — **Fig. 5.**
Recalibration of base quality scores using monomorphic sites improves performance. (A) Reported base quality scores do not match empirical scores calculated from real data using monomorphic sites. (B) The EM algorithm was run with and without base quality score recalibration on simulated pooled 100-bp paired-end sequence data from 162 haplotypes at 200× coverage in a 200 kb region (100 replicates each). Sequence errors in the simulated data were introduced with probabilities given by the empirical error rates.

F<sc>ig</sc>. 6. — **Fig. 6.**
Performance of the EM algorithm on the calculation of species-level abundances from 16S rRNA sequence data. The algorithm was run on simulated 75-bp single-end 16S sequence data from pools of 200 randomly chosen microbial species, with 100 replicates for each coverage level.

F<sc>ig</sc>. 7. — **Fig. 7.**
Large numbers of unknown unrelated species do not significantly affect the estimate of within-genus species frequencies. The EM algorithm with haplotype likelihood filter was run on simulated 75-bp single-end 16S sequence reads from 500 species (20 *Clostridium* species [known] and 480 non-*Clostridium* species [unknown]). “Unknown proportion” is the total proportion of reads coming uniformly at random from the 480 unknown species, with the remainder of the reads coming from the 20 known species (100× pooled coverage, 100 replicates for each unknown proportion level).

F<sc>ig</sc>. 8. — **Fig. 8.**
Sequence reads from unknown unrelated species push frequency estimates toward the uniform distribution; filtering the reads based on haplotype likelihood minimizes this effect. (A) Sequence reads from unknown unrelated species have low maximum haplotype likelihoods, giving rise to the long left tail of the distribution. By calculating the theoretical distribution (blue) of maximum haplotype likelihoods based on the base quality scores from the sequence data, reads whose maximum haplotype likelihood falls below a specified threshold (red, z-score threshold = −2 in this example) can be filtered out. (B) A typical example of this effect, with 50% of the reads coming from the unknown species. (C) Without filtering on the haplotype likelihoods, error in frequency estimates increases with higher proportion of unknown sequence reads. Seventy-five-base-pair single-end 16S sequence reads were simulated from 20 species, with varying proportions of reads from an unknown unrelated species (100× pooled coverage, 100 replicates per unknown proportion level).

F<sc>ig</sc>. 9. — **Fig. 9.**
When the pool contains an unknown species that is related to one of the known species, sequence reads from the unknown increase the frequency estimate of the most closely related known species but have little effect on the estimation of the relative frequencies of the other known species. (A) Sequence reads from an unknown species (black) related to one of the known species (orange) contribute to the frequency estimate of that species. Shown is a typical example of this effect, with 50% of the reads coming from the unknown species. The relative abundances of the other known species are estimated accurately. (B) Presence of the unknown species has little effect on the estimation of the relative frequencies of the other (unrelated) known species. Seventy-five-base-pair single-end 16S sequence reads were simulated from 20 known species and 1 unknown related species (100× pooled coverage, 100 replicates for each unknown proportion level).

See this image and copyright information in PMC

Cited by

"Select and Resequence" Methods Enable a Genome-Wide Association Study of the Dimorphic Human Fungal Pathogen Coccidioides posadasii.
Voorhies M, Joehnk B, Uehling J, Walcott K, Dubin CA, Mead HL, Homer CM, Galgiani JN, Barker BM, Brem RB, Sil A. Voorhies M, et al. Genome Biol Evol. 2025 Jul 3;17(7):evaf135. doi: 10.1093/gbe/evaf135. Genome Biol Evol. 2025. PMID: 40611625 Free PMC article.
Phenotypic and genomic signatures of interspecies cooperation and conflict in naturally occurring isolates of a model plant symbiont.
Batstone RT, Burghardt LT, Heath KD. Batstone RT, et al. Proc Biol Sci. 2022 Jul 13;289(1978):20220477. doi: 10.1098/rspb.2022.0477. Epub 2022 Jul 13. Proc Biol Sci. 2022. PMID: 35858063 Free PMC article.
Shifting the paradigm in Evolve and Resequence studies: From analysis of single nucleotide polymorphisms to selected haplotype blocks.
Barghi N, Schlötterer C. Barghi N, et al. Mol Ecol. 2019 Feb;28(3):521-524. doi: 10.1111/mec.14992. Mol Ecol. 2019. PMID: 30793868 Free PMC article.
Host-Associated Rhizobial Fitness: Dependence on Nitrogen, Density, Community Complexity, and Legume Genotype.
Burghardt LT, Epstein B, Hoge M, Trujillo DI, Tiffin P. Burghardt LT, et al. Appl Environ Microbiol. 2022 Aug 9;88(15):e0052622. doi: 10.1128/aem.00526-22. Epub 2022 Jul 19. Appl Environ Microbiol. 2022. PMID: 35852362 Free PMC article.
Detection of Pathogenic Microbe Composition Using Next-Generation Sequencing Data.
Zhao H, Wang S, Yuan X. Zhao H, et al. Front Genet. 2020 Nov 30;11:603093. doi: 10.3389/fgene.2020.603093. eCollection 2020. Front Genet. 2020. PMID: 33329748 Free PMC article.

See all "Cited by" articles

References

1. Burke MK, Dunham JP, Shahrestani P, Thornton KR, Rose MR, Long AD. Genome-wide analysis of a long-term evolution experiment with Drosophila. Nature. 2010;467:587–590. - PubMed
1. Cheesman SJ, de Roode JC, Read AF, Carter R. Real-time quantitative PCR for analysis of genetically mixed infections of malaria parasites: technique validation and applications. Mol Biochem Parasitol. 2003;131:83–91. - PubMed
1. Cutler DJ, Jensen JD. To pool, or not to pool? Genetics. 2010;186:41–43. - PMC - PubMed
1. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39:1–38.
1. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72:5069–5072. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Affiliation

Maximum likelihood estimation of frequencies of known haplotypes from pooled sequence data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources