Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jul 8:2023.07.03.23292162.
doi: 10.1101/2023.07.03.23292162.

Increased frequency of repeat expansion mutations across different populations

Affiliations

Increased frequency of repeat expansion mutations across different populations

Kristina Ibañez et al. medRxiv. .

Update in

  • Increased frequency of repeat expansion mutations across different populations.
    Ibañez K, Jadhav B, Zanovello M, Gagliardi D, Clarkson C, Facchini S, Garg P, Martin-Trujillo A, Gies SJ, Galassi Deforie V, Dalmia A, Hensman Moss DJ, Vandrovcova J, Rocca C, Moutsianas L, Marini-Bettolo C, Walker H, Turner C, Shoai M, Long JD, Fratta P, Langbehn DR, Tabrizi SJ, Caulfield MJ, Cortese A, Escott-Price V, Hardy J, Houlden H, Sharp AJ, Tucci A. Ibañez K, et al. Nat Med. 2024 Nov;30(11):3357-3368. doi: 10.1038/s41591-024-03190-5. Epub 2024 Oct 1. Nat Med. 2024. PMID: 39354197 Free PMC article.

Abstract

Repeat expansion disorders (REDs) are a devastating group of predominantly neurological diseases. Together they are common, affecting 1 in 3,000 people worldwide with population-specific differences. However, prevalence estimates of REDs are hampered by heterogeneous clinical presentation, variable geographic distributions, and technological limitations leading to under-ascertainment. Here, leveraging whole genome sequencing data from 82,176 individuals from different populations, we found an overall disease allele frequency of REDs of 1 in 283 individuals. Modelling disease prevalence using genetic data, age at onset and survival, we show that the expected number of people with REDs would be two to three times higher than currently reported figures, indicating under-diagnosis and/or incomplete penetrance. While some REDs are population-specific, e.g. Huntington disease-like 2 in Africans, most REDs are represented in all broad genetic ancestries (i.e. Europeans, Africans, Americans, East Asians, and South Asians), challenging the notion that some REDs are found only in specific populations. These results have worldwide implications for local and global health communities in the diagnosis and counselling of REDs.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement The authors declare no competing interests

Figures

Figure 1.
Figure 1.
A) List of RED loci included in the study including repeat-size thresholds for reduced penetrance and full mutations. B) Technical flowchart. Whole genome sequences (WGS) from the 100K GP and TOPMed datasets were first selected by excluding those associated with neurological diseases. WGS data from the 1K GP3 were also selected by having the same technical specifications (see Methods). After inferring ancestry prediction, repeat sizes for all 22 REDs were computed by using EH v3.2.2. On one hand, for 16 REDs overall carrier frequency, disease modelling, and correlation distribution of long normal alleles were computed in the 100K GP and TOPMed projects. On the other hand, the distribution of repeat sizes across different populations was analysed in the 100K GP and TOPMed combined, and in the 1K GP3 cohorts.
Figure 2.
Figure 2.
Forest plot with combined overall disease allele carrier frequency in the combined 100K GP and TOPMed datasets N = 82,176 (N individuals may vary slightly between loci due to data quality and filtering, See Table S7). The squares show the estimated disease allele carrier frequency, and the bars show the 95% CI values. Details of the statistical models are described in the Methods section. Grey and black boxes show premutation/reduced penetrance and full mutation allele carrier frequencies for each dominant locus, respectively. Grey and black boxes show mono- and bi-allelic carrier frequencies for recessive loci (RFC1 and FXN), respectively.
Figure 3.
Figure 3.
Flowchart showing the modelling of disease prevalence by age for C9orf72-ALS, C9orf72-FTD, HD in 40 CAG repeat carriers, SCA2, DM1, SCA1, and SCA6. UK population count by age is multiplied by the combined disease allele frequency of each genetic defect and the age of onset distribution of each corresponding disease, and corrected for median survival. Penetrance is also taken into account for C9orf72-ALS and C9orf72-FTD. Estimated number of people affected by REDs (dark blue area) compared to the reported prevalence from the literature (light blue area). Age bins are 5 years each. For C9orf72-FTD, given the wide range of the reported disease prevalence,, both lower and upper limits are plotted in light blue.
Figure 4.
Figure 4.
Pathogenic RED frequencies in different populations (African = 12,786; American = 5,674; East Asian = 1,266; European = 59,568; South Asian = 2,882. A) Forest plot of pathogenic allele carrier frequency divided by population. Pathogenic alleles are defined as those larger than the premutation cut-off (Table1). Data are presented as squares showing the estimated pathogenic allele carrier frequency, and bars showing the 95% CI values. B) Bar chart showing the proportion of pathogenic allele carrier frequency repeats by ancestry. Both plots have been generated by combining data from 100K GP and TOPMed from a total of N = 82,176 unrelated genomes. N individuals may vary slightly between loci due to data quality and filtering, see Table S17 and Table S18).
Figure 5.
Figure 5.
Distribution of repeat lengths in different populations. A) Half-violin plots showing the distribution of alleles in different populations (African = 12,786; American = 5,674; East Asian = 1,266; European = 59,568; South Asian = 2,882) for 10 loci (Methods) from the combined 100K GP and TOPMed cohorts. Box plots highlight the interquartile range and median, and black dots show values outside 1.5 times the interquartile range. Red dots mark the 99.9th percentile for each population and locus. Vertical bars indicate the intermediate and pathogenic allele thresholds (Table S20). B) Scatter plot shows the frequency of intermediate allele carriers (x-axis) against the frequency of pathogenic allele carriers (y-axis). Data points are divided by population (n=5) and gene (n=10), and size represents the total number of intermediate alleles. Correlations were computed using the Spearman method, and two-tailed p-values.
Figure 6.
Figure 6.
HTT repeat structures show varied prevalence across genetic ancestries and are associated with CAG repeat size. A) Allele structures observed within exon 1 of HTT. The CAG repeat is denoted as “Q1” and marked in gold. The CAACAG unit is referred to as “Q2” and is marked in green. The first proline-encoding “CCGCCA” repeat element is referred to as “P1” and is marked in purple. B) The prevalence of the allele structures is plotted across the studied genetic ancestries in bar plots on the x-axis. The ancestries are defined on the y-axis. The number of alleles in each of the genetic ancestries is denoted as “N=...” at each of the y-axis ticks. C) Boxplots display the distribution of CAG repeat sizes across different repeat structures. Box plots highlight the median (horizontal lines in the centre of each boxplot), interquartile range (bounds) and black dots show values outside 1.5 times the interquartile range. The repeat structures are separated on the x-axis and the repeat size is shown on the y-axis. The number of alleles with different repeat structures is denoted as “N=...” on the x-axis. A linear model was used to compare the repeat size distribution of the canonical alleles versus that of all atypical structures. Kruskal-Wallis tests with Dunn’s correction for multiple comparisons p value; p-values resulting from pairwise tests are displayed above each structure (*** < 0.001; * < 0.05). Q2 versus canonical (p-value = 6.4×10−32), Q2 versus partialQ2 loss (p-value = 3.5×10−2), Q2 duplication versus P1 loss (p-value = 5.9×10−98), Q2 duplication versus Q2 loss (p-value = 8.5×10−16); Q2 duplication versus Q2-P1 loss (p-value = 6.2×10−20), canonical versus P1 loss (p-value = 2.4×10−80), canonical versus Q2 loss (p-value = 2.8×10−8), canonical versus Q2-P1 loss (p value = 1.2×1012), P1 loss versus Q2 loss (p-value = 2.8×10−2), P1 loss versus vs Q2-P1 loss ( p-value = 5.6×10−6)

References

    1. Paulson H. Repeat expansion diseases. Handb. Clin. Neurol. 147, 105–123 (2018). - PMC - PubMed
    1. Cortese A. et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat. Genet. 51, 649–658 (2019). - PMC - PubMed
    1. Moore K. M. et al. Age at symptom onset and death and disease duration in genetic frontotemporal dementia: an international retrospective cohort study. Lancet Neurol. 19, 145–156 (2020). - PMC - PubMed
    1. Gossye H., Engelborghs S., Van Broeckhoven C. & van der Zee J. C9orf72 Frontotemporal Dementia And/or Amyotrophic Lateral Sclerosis. (University of Washington, Seattle, 2020). - PubMed
    1. van der Ende E. L. et al. Unravelling the clinical spectrum and the role of repeat length in C9ORF72 repeat expansions. J. Neurol. Neurosurg. Psychiatry 92, 502–509 (2021). - PMC - PubMed

Methods-only references

    1. Taliun D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). - PMC - PubMed
    1. Jun G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012). - PMC - PubMed
    1. Chang C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). - PMC - PubMed
    1. Ibañez K. et al. Whole genome sequencing for the diagnosis of neurological repeat expansion disorders in the UK: a retrospective diagnostic accuracy and prospective clinical validation study. Lancet Neurol. 21, 234–245 (2022). - PMC - PubMed
    1. Dolzhenko E. et al. ExpansionHunter: A sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019). - PMC - PubMed

Publication types

Grants and funding