. 2022 Sep 23:2:943625.

doi: 10.3389/fepid.2022.943625. eCollection 2022.

A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data

Henri Christian Junior Tsoungui Obama¹, Kristan Alexander Schneider¹

Affiliations

PMID: 38455338
PMCID: PMC10911023
DOI: 10.3389/fepid.2022.943625

A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data

Henri Christian Junior Tsoungui Obama et al. Front Epidemiol. 2022.

. 2022 Sep 23:2:943625.

doi: 10.3389/fepid.2022.943625. eCollection 2022.

Authors

Henri Christian Junior Tsoungui Obama¹, Kristan Alexander Schneider¹

Affiliation

¹ Department of Applied Computer- and Biosciences, University of Applied Sciences Mittweida, Mittweida, Germany.

PMID: 38455338
PMCID: PMC10911023
DOI: 10.3389/fepid.2022.943625

Abstract

The introduction of genomic methods facilitated standardized molecular disease surveillance. For instance, SNP barcodes in Plasmodium vivax and Plasmodium falciparum malaria allows the characterization of haplotypes, their frequencies and prevalence to reveal temporal and spatial transmission patterns. A confounding factor is the presence of multiple genetically distinct pathogen variants within the same infection, known as multiplicity of infection (MOI). Disregarding ambiguous information, as usually done in ad-hoc approaches, leads to less confident and biased estimates. We introduce a statistical framework to obtain maximum-likelihood estimates (MLE) of haplotype frequencies and prevalence alongside MOI from malaria SNP data, i.e., multiple biallelic marker loci. The number of model parameters increases geometrically with the number of genetic markers considered and no closed-form solution exists for the MLE. Therefore, the MLE needs to be derived numerically. We use the Expectation-Maximization (EM) algorithm to derive the maximum-likelihood estimates, an efficient and easy-to-implement algorithm that yields a numerically stable solution. We also derive expressions for haplotype prevalence based on either all or just the unambiguous genetic information and compare both approaches. The latter corresponds to a biased ad-hoc estimate of prevalence. We assess the performance of our estimator by systematic numerical simulations assuming realistic sample sizes and various scenarios of transmission intensity. For reasonable sample sizes, and number of loci, the method has little bias. As an example, we apply the method to a dataset from Cameroon on sulfadoxine-pyrimethamine resistance in P. falciparum malaria. The method is not confined to malaria and can be applied to any infectious disease with similar transmission behavior. An easy-to-use implementation of the method as an R-script is provided.

Keywords: EM-algorithm; complexity of infection (COI); drug resistance; haplotype phasing; malaria; multiplicity of infection (MOI); resistance markers; sulfadoxine-pyrimethamine (SP).

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Ambiguity in haplotype information for two biallelic loci. Illustrated are three different infections from the same pathogen population. The first infection **(middle left)** describes a super-infection with two haplotypes, i.e., MOI = 2. The corresponding observation **(bottom left)** provides only unphased (i.e., ambiguous) haplotype information. It is impossible to reconstruct with certainty the haplotypes actually present in the infection and the corresponding MOI. The second infection **(middle)**, illustrates a super-infection with two haplotypes transmitted one and two times, respectively, i.e., MOI = 3. From the observed information, the haplotypes present in the infections can be unambiguously phased. However, MOI remains unknown. The last infection **(middle right)** is similar to the second, however with MOI = 4.

**Figure 2**
Bias of frequencies estimates—the symmetric case. Shown is the bias of the frequency estimates in % as a function of the mean MOI (i.e., for a range of Poisson parameters). The symmetric haplotype frequency distributions (cf. Table 1) for n = 2 **(A)** and n = 5 **(B)** are assumed. In both panels, only the bias for the first haplotype is shown (in the symmetric case, all haplotypes are equivalent and bias looks similarly). Colors correspond to different sample sizes.

**Figure 3**
Bias of frequencies estimates—the unbalanced case. Shown is the bias of the frequency estimates in % as a function of the mean MOI (i.e., for a range of Poisson parameters). The skewed haplotype frequency distributions (cf. Table 1) for n = 2 **(A,B)** and n = 5 **(C,D)** are assumed. In both cases only the bias for the predominant haplotype and one underrepresented haplotype are shown (all underrepresented haplotypes are equivalent and bias looks similarly). Colors correspond to different sample sizes.

**Figure 4**
Bias of frequencies estimates—Kenya data 2005. Shown is the bias of the frequency estimates in % as a function of the mean MOI (i.e., for a range of Poisson parameters). The haplotype frequency distributions (cf. Table 2) for n = 10 are assumed. The bias for the predominant haplotype and few underrepresented haplotype are shown **(A–D)**, the corresponding haplotype frequencies are shown at the top of the panels. Colors correspond to different sample sizes.

**Figure 5**
Bias of frequencies estimates—Kenya data 2010. Shown is the bias of the frequency estimates in % as a function of the mean MOI (i.e., for a range of Poisson parameters). The haplotype frequency distributions (cf. Table 2) for n = 10 are assumed. The bias for the predominant haplotype and few underrepresented haplotype are shown **(A–D)**, the corresponding haplotype frequencies are shown at the top of the panels. Colors correspond to different sample sizes.

**Figure 6**
Bias of MOI estimates. Shown is the bias of the mean MOI estimates ψ in % as a function of the true mean MOI (i.e., for a range of Poisson parameters). Symmetric haplotype frequency distributions (Table 1) are assumed for n = 2 **(A)** and n = 5 **(B)**, whereas skewed distributions are assumed in **(C,D)**, for n = 2 and n = 5, respectively. Colors correspond to different sample sizes.

**Figure 7**
Bias of MOI estimates. As in Figure 6 but for n = 10 for the haplotype estimated for antimalarial drug resistance in Kenya (see Table 2) in 2005 **(A)** and 2010 **(B)**.

**Figure 8**
Variance of MOI estimates. Shown is the variance of the mean MOI estimates ψ in % as a function of the true mean MOI (i.e., for a range of Poisson parameters). The symmetric haplotype frequency distribution is assumed respectively for n = 2 and n = 5 **(A,B)** as well as the skewed haplotype frequency distributions **(C,D)** (cf. Table 1). Colors correspond to different sample sizes.

**Figure 9**
Variance of MOI estimates. Shown is the variance of the mean MOI estimates ψ in % as a function of the true mean MOI (i.e., for a range of Poisson parameters). The haplotype frequency distributions for n = 10 are assumed, respectively, for the year 2005 **(A)** and 2010 **(B)** (cf. Table 2). Colors correspond to different sample sizes.

**Figure 10**
Prevalence estimates—the symmetric case. Shown is the prevalence of the haplotypes as a function of the mean MOI (i.e., for a range of Poisson parameters). The symmetric haplotype frequency distributions (cf. Table 1) for n = 2 **(A,B)** and n = 5 **(C,D)** are assumed. In both cases of n only the prevalence estimates for the first haplotype are shown for a small (N = 50) and big (N = 500) sample size (in the symmetric case all haplotypes are equivalent and prevalence looks similarly). Colors correspond to different prevalence models. The solid line show the true prevalence and the dashed line the estimates.

**Figure 11**
Prevalence estimates—the unbalanced case. Shown is the prevalence of the predominant haplotypes as a function of the mean MOI (i.e., for a range of Poisson parameters). The skewed haplotype frequency distributions (cf. Table 1) for n = 2 **(A,B)** and n = 5 **(C,D)** are assumed. In both cases of n only the prevalence estimates are shown for a small (N = 50) and big (N = 500) sample size. Colors correspond to different prevalence models. The solid lines show the true prevalence and the dashed lines the estimates.

**Figure 12**
Prevalence estimates—the unbalanced case: Shown is the prevalence of the underrepresented haplotypes as a function of the mean MOI (i.e., for a range of Poisson parameters). The skewed haplotype frequency distributions (cf. Table 1) for n = 2 **(A,B)** and n = 5 **(C,D)** are assumed. In both cases of n only the prevalence of one of the underrepresented haplotypes estimates are shown for a small (N = 50) and big (N = 500) sample size (all underrepresented haplotypes are equivalent and prevalence looks similarly). Colors correspond to different prevalence models. The solid lines show the true prevalence and the dashed lines the estimates.

**Figure 13**
Prevalence estimates—the unbalanced case: Shown is the prevalence of the dominant and one underrepresented haplotype for the years 2005 **(A,B)** and 2010 **(C,D)** as a function of the mean MOI (i.e., for a range of Poisson parameters). The haplotype frequency distributions (cf. Table 2) for n = 10 are assumed. The prevalence estimates are shown for N = 50. Colors correspond to different prevalence models. The solid lines show the true prevalence and the dashed lines the estimates.

See this image and copyright information in PMC

Cited by

Haplotype based testing for a better understanding of the selective architecture.
Chen H, Pelizzola M, Futschik A. Chen H, et al. BMC Bioinformatics. 2023 Aug 26;24(1):322. doi: 10.1186/s12859-023-05437-3. BMC Bioinformatics. 2023. PMID: 37633901 Free PMC article.
Estimating multiplicity of infection, haplotype frequencies, and linkage disequilibria from multi-allelic markers for molecular disease surveillance.
Tsoungui Obama HCJ, Schneider KA. Tsoungui Obama HCJ, et al. PLoS One. 2025 May 27;20(5):e0321723. doi: 10.1371/journal.pone.0321723. eCollection 2025. PLoS One. 2025. PMID: 40424286 Free PMC article.
Review of MrsFreqPhase methods: methods designed to estimate statistically malaria parasite multiplicity of infection, relatedness, frequency and phase.
Taylor AR, Neubauer Vickers E, Greenhouse B. Taylor AR, et al. Malar J. 2024 Oct 15;23(1):308. doi: 10.1186/s12936-024-05119-2. Malar J. 2024. PMID: 39407242 Free PMC article. Review.
The many definitions of multiplicity of infection.
Schneider KA, Tsoungui Obama HCJ, Kamanga G, Kayanula L, Adil Mahmoud Yousif N. Schneider KA, et al. Front Epidemiol. 2022 Oct 5;2:961593. doi: 10.3389/fepid.2022.961593. eCollection 2022. Front Epidemiol. 2022. PMID: 38455332 Free PMC article.
SNP-slice resolves mixed infections: simultaneously unveiling strain haplotypes and linking them to hosts.
Ju N, Liu J, He Q. Ju N, et al. Bioinformatics. 2024 Jun 3;40(6):btae344. doi: 10.1093/bioinformatics/btae344. Bioinformatics. 2024. PMID: 38885409 Free PMC article.

References

1. Horstmann DM. Importance of disease surveillance. Prevent Med. (1974) 3:436–42. 10.1016/0091-7435(74)90003-6 - DOI - PubMed
1. Krishna B. Disease surveillance: the bedrock for control and prevention. Indian J Crit Care Med. (2021) 25:745–6. 10.5005/jp-journals-10071-23908 - DOI - PMC - PubMed
1. Richards CL, Iademarco MF, Atkinson D, Pinner RW, Yoon P, MacKenzie WR, et al. . Advances in public health surveillance and information dissemination at the centers for disease control and prevention. Publ Health Rep. (2017) 132:403–10. 10.1177/0033354917709542 - DOI - PMC - PubMed
1. Gwinn M, MacCannell DR, Khabbaz RF. Integrating advanced molecular technologies into public health. J Clin Microbiol. (2017) 55:703–14. 10.1128/JCM.01967-16 - DOI - PMC - PubMed
1. Lo SW, Jamrozy D. Genomics and epidemiological surveillance. Nat Rev Microbiol. (2020) 18:478. 10.1038/s41579-020-0421-0 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data

Affiliation

A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Research Materials