Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 3;42(2):msaf019.
doi: 10.1093/molbev/msaf019.

Estimating Gene Conversion Tract Length and Rate From PacBio HiFi Data

Affiliations

Estimating Gene Conversion Tract Length and Rate From PacBio HiFi Data

Anders Poulsen Charmouh et al. Mol Biol Evol. .

Abstract

Gene conversions are broadly defined as the transfer of genetic material from a "donor" to an "acceptor" sequence and can happen both in meiosis and mitosis. They are a subset of noncrossover (NCO) events and, like crossover (CO) events, gene conversion can generate new combinations of alleles and counteract mutation load by reverting germline mutations through GC-biased gene conversion. Estimating gene conversion rate and the distribution of gene conversion tract lengths remains challenging. We present a new method for estimating tract length, rate, and detection probability of NCO events directly in HiFi PacBio long read data. The method can be used to make inference from sequencing of gametes from a single individual. The method is unbiased even under low single nucleotide variant (SNV) densities and does not necessitate any demographic or evolutionary assumptions. We test the accuracy and robustness of our method using simulated datasets where we vary length of tracts, number of tracts, the genomic SNV density, and levels of correlation between SNV density and NCO event position. Our simulations show that under low SNV densities, like those found in humans, only a minute fraction (∼2%) of NCO events are expected to become visible as gene conversions by moving at least 1 SNV. We finally illustrate our method by applying it to PacBio sequencing data from human sperm.

Keywords: gene conversion; genome evolution; genomics methods; noncrossover; recombination.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest: None

Figures

Fig. 1.
Fig. 1.
A method for estimating NCO tract length, rate, and detection probability. a) The data. HiFi long reads were screened by Porsborg et al. (2024) for gene conversion, and read data are summarized via the counts of the number of reads where 1, 2, …, n SNVs are converted (conversion counts). b) The simulations. Using the SNV distribution along the sequenced sample genome, simulations are conducted with varying mean tract length and the probability of converting 1, 2, …, n SNVs (terms in Equation (5)) are estimated contingent on the SNV density and tract length distribution (conversion probabilities). c) The estimates. Using the conversion probabilities, which take the nonuniform SNV distribution into account, the mean tract length, which maximizes the likelihood of the data (the conversion counts), is estimated using Equation (8).
Fig. 2.
Fig. 2.
Analytical predictions of the ratio model (full lines) compared to simulation results (points denote means of 25 replicates) with 95% confidence intervals under different SNV densities, using approximately half the human SNV density as a proxy for the typical number of heterozygous sites (0.00083/2 SNVs/bp, e.g. Zhao et al. 2003), human SNV density (0.00083 SNVs/bp), and 5 times the human SNV density (0.000835 SNVs/bp). The results show that in the idealized case where all positions in the genome have some probability of being an SNV, using the ratio of single to multi-SNV conversions can yield unbiased estimates despite very low SNV densities, such as those observed in populations of humans. a) Equation (5): Ratio of single to multi-SNV conversions as a function of mean tract length for 3 different SNV densities. b) Detection probability as 1-S(p,s) (see Equation (6)) of all gene conversion events as a function of mean tract length for 3 different SNV densities.
Fig. 3.
Fig. 3.
MLEs of mean gene conversion tract lengths, rate, and detection probability inferred from gene conversion events called directly from HiFi PacBio data of sperm sample as obtained by Porsborg et al. (2024). a) Likelihood profile for mean tract length. Each point shows the log likelihood of the data (counts of single, double, triple, …, n-tuple gene conversion events for the sampled individual) conditional on the SNV distribution and density of the individual, i.e. Equation (8). Dotted vertical line represents the MLE and the vertical dashed lines show the 95% confidence interval. The results suggest that human gene conversion tracts are typically quite short (mean tract length of 46 bp). b) MLE of NCO detection probability (probability that an NCO becomes a gene conversion) and total NCO rate (including gene conversions). Bars denote the 95% confidence interval. The results indicate that most NCO events fail to convert at least 1 SNV meaning that these are not observable as gene conversions. Each MLE is based on 103 simulations using the SNV distribution along the genome in the sample data.
Fig. 4.
Fig. 4.
Comparison of MLEs under strong positive correlation between NCO position and SNV density, strong negative correlation between NCO position and SNV density and no correlation between NCO position and SNV density (see legend). a) The MLE of tract length changes by −12 bp under strong positive correlation and +2 under strong negative correlation, suggesting the model is robust to correlation between NCO events and SNV density and overall heterogeneity of NCO positions along the genome. b) Strong positive correlation between SNVs and NCO events can result in underestimation of the NCO rate whereas strong negative correlation can result in overestimation. c) Strong positive correlation between SNVs and NCOs can result in overestimation of the detection probability whereas strong negative correlation can result in underestimation. This is especially the case when tracts are long.
Fig. 5.
Fig. 5.
Accuracy of inference under a), c), and e) the idealized model (uniform SNV distribution) and b), d), and f) the maximum likelihood model (used to perform inference genomic data). Accuracy of inference is shown under different sample sizes (i.e. different number of identified tracts): 182 tracts, as used in this study (a and b), 100 tracts (c and d), and 50 tracts (e and f). Dashed line show perfect accuracy (estimate tract length/true tract length) = 1. Each boxplot contains 100 replicates (see Methods for details).

References

    1. Arbeithuber B, Betancourt AJ, Ebner T, Tiemann-Boege I. Crossovers are associated with mutation and biased gene conversion at recombination hotspots. Proc Natl Acad Sci U S A. 2015:112(7):2109–2114. 10.1073/pnas.1416622112. - DOI - PMC - PubMed
    1. Arnheim N, Li H, Cui X. Genetic mapping by single sperm typing. Anim Genet. 1991:22(2):105–115. 10.1111/j.1365-2052.1991.tb00652.x. - DOI - PubMed
    1. Arndt PF, Massip F, Sheinman M. An analytical derivation of the distribution of distances between heterozygous sites in diploid species to efficiently infer demographic history. bioRxiv. 2023. 10.1101/2023.09.20.558510, 17 January 2025, preprint: not peer reviewed. - DOI
    1. Barroso GV, Dutheil JY. The landscape of nucleotide diversity in Drosophila melanogaster is shaped by mutation rate variation. Peer Community J. 2023:3:3–12. 10.24072/pcjournal.267. - DOI
    1. Bengtsson BO. Biased conversion as the primary function of recombination. Genet Res. 1986:47(1):77–80. 10.1017/S001667230002454X. - DOI - PubMed

LinkOut - more resources