Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;57(2):390-401.
doi: 10.1038/s41588-024-02051-8. Epub 2025 Jan 8.

Structural polymorphism and diversity of human segmental duplications

Affiliations

Structural polymorphism and diversity of human segmental duplications

Hyeonsoo Jeong et al. Nat Genet. 2025 Feb.

Abstract

Segmental duplications (SDs) contribute significantly to human disease, evolution and diversity but have been difficult to resolve at the sequence level. We present a population genetics survey of SDs by analyzing 170 human genome assemblies (from 85 samples representing 38 Africans and 47 non-Africans) in which the majority of autosomal SDs are fully resolved using long-read sequence assembly. Excluding the acrocentric short arms and sex chromosomes, we identify 173.2 Mb of duplicated sequence (47.4 Mb not present in the telomere-to-telomere reference) distinguishing fixed from structurally polymorphic events. We find that intrachromosomal SDs are among the most variable, with rare events mapping near their progenitor sequences. African genomes harbor significantly more intrachromosomal SDs and are more likely to have recently duplicated gene families with higher copy numbers than non-African samples. Comparison to a resource of 563 million full-length isoform sequencing reads identifies 201 novel, potentially protein-coding genes corresponding to these copy number polymorphic SDs.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.E.E. is a scientific advisory board member of Variant Bio. C.L. is a scientific advisory board member of Nabsys and Genome Insight. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Pangenome representation of human SDs.
Haplotype frequency distribution of intrachromosomal SD content from HPRC and HGSVC haplotype genome assemblies (n = 170). SDs are colored by haplotype frequency. SD content on the p-arms of acrocentric chromosomes (chr13, chr14, chr15, chr21 and chr22) was excluded because of assembly errors and potential chromosomal misassignment compared to other autosomal chromosomes. The known SDs of T2T-CHM13 are shown in black next to the ideograms on each chromosome. Source data
Fig. 2
Fig. 2. Cumulative sum of SDs by frequency.
Bar plot displaying the cumulative sum of SD content by adding genomes (from left to right) for intrachromosomal and interchromosomal SDs. Four SD frequency categories are considered: ‘fixed’ are SDs present in all 170 human genome assemblies (that is, conserved in all samples); ‘polymorphic (known)’ are SDs in the reference genome (T2T-CHM13) that are not fixed; ‘polymorphic (novel)’ refers to SDs observed in two or more HPRC or HGSVC assemblies yet not present in T2T-CHM13; and ‘private’ is an SD found in one sample. Samples are grouped by non-African (non-AFR) and then African (AFR) genetic ancestry owing to the expected increased diversity among the latter. Source data
Fig. 3
Fig. 3. Sequence properties of polymorphic versus rare SDs.
a, Histogram comparing the sequence identity and length of rare and common SDs (see Supplementary Fig. 1 for polymorphic SDs with more subclassified haplotype frequencies). b, Orientation and pairwise dispersion of polymorphic and singleton SDs. Each data point represents haplotype assembly (n = 170) and their counts of clustered, interspersed (>1 Mb apart) and distant (>50 Mb apart) SDs. Left and right panels summarize the SDs in direct or inverted orientation, and the top and bottom panels contrast polymorphic versus singleton SDs. The box plot ranges represent the interquartile range (first and third quartile), the horizontal line in each box indicates the median and whiskers indicate data points within 1.5× the interquartile range. In each panel, a two-tailed Wilcox ranked sum test was performed between clustered SDs versus the interspersed or distant SDs; ns, not significant; **P < 0.01; ****P < 0.0001. Source data
Fig. 4
Fig. 4. Examples of clustered and interspersed (>1 Mb apart) SDs associated with genes.
In each plot, the top represents the T2T-CHM13 genome aligned to the bottom, new genome assemblies. a, Clustered duplication with inverted orientation (65.8 kb; with allele frequency [AF] = 1) found in chr5. bd, Clustered and tandem duplications (12.6, 10.3 and 42.3 kb; AFs of 1, 2 and 1, respectively) in chr9, chr13 and chr1. e,f, Interspersed duplications of chr 12 (98.9 and 2.5 kb; AFs of 34 and 8) showing duplicated regions in left and right panels. The gene track of the T2T-CHM13 genome assembly is shown at the top, followed by SDs predicted by SEDEF; the respective direction is indicated by blue arrowheads. The DupMasker track shows the duplicon structure.
Fig. 5
Fig. 5. Variable copy number of duplicated genes.
a,b, Gene families with highly variable (a) and nearly fixed (b) copy numbers are displayed. Gene families are selected and ordered by dispersion index, requiring an average diploid copy number greater than three. Read-depth copy number was estimated with fastCN, using Illumina reads for each sample and comparing it to the T2T-CHM13 genome. c, Estimated copy number of GOLGA6/8 paralogs in each assembled haplotype, based on assembly alignments (white, 0; black, 1; blue, 2). The continental population groups for each haplotype are indicated by color above each column (Africa, gold; East Asia, green; South Asia, purple; Europe, blue; the Americas, red). ASD, autism spectrum disorder; DD, developmental delay; ID, intellectual disability; SCZ, schizophrenia. Source data
Fig. 6
Fig. 6. African versus non-African SD copy number variation.
a, Proportion of intrachromosomal SD content between African and non-African populations. African genomes (n = 76) have a higher SD content than non-African genomes (n = 94), and the difference is significant for intrachromosomal SDs. The box plot ranges represent the interquartile range (first and third quartile), the horizontal line in each box is the median and the whisker indicates the data points within 1.5× the interquartile range. In each panel, a two-tailed Wilcox ranked sum test was performed. b, Gene family copy number (CN) variation between populations. Darker colors represent higher counts of each copy number, normalized per gene. Gene families with significant copy number differences between African and non-African populations are shown (Mann–Whitney U-test, Benjamini–Hochberg adjusted P < 0.05), excluding GUSPB3, which did not replicate in the larger cohort. Gene copy number was estimated from the assemblies by whole-genome alignment; 13 out of 16 gene families average higher copy number in individuals of African ancestry (binomial test, P = 0.01). c, Gene copy number evaluated by Illumina read depth. The 22 gene families with the largest distribution shift are shown. Source data
Fig. 7
Fig. 7. Discovery of novel genes and transcripts in rare and polymorphic SD regions.
a, 2D histogram display of copy number polymorphic gene families for which FLNC generated from Iso-Seq map better to the pangenome than to the T2T-CHM13 human genome reference. Darker colors represent higher counts of each copy number, normalized per gene. be, Selected haplotypes containing novel gene predictions for LRRC37A (b), MUC20 (c), NBPF1 (d) and CTAGE (e) compared to T2T-CHM13 reference where there is FLNC transcript support. Alignment color indicates percent identity. f, Comparison of T2T-CHM13 (top) and HG002 maternal haplotype (bottom) depicts 48 kb polymorphic SD region present in 66 out of 170 haplotypes. Non-human apes all carry a copy of the duplicated sequence. ZNF predicted recognition site shown (inset). g, Comparison of the novel ZNF to its best human match (ZNF98, 68% identity) and the most similar existing primate annotation (low-quality protein ZNF724-like in gorilla, 95% identity). ProSite-predicted KRAB-ZFP is shown above the sequence.

Update of

Similar articles

Cited by

  • Complex genetic variation in nearly complete human genomes.
    Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Scholz S, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. bioRxiv [Preprint]. 2024 Sep 25:2024.09.24.614721. doi: 10.1101/2024.09.24.614721. bioRxiv. 2024. Update in: Nature. 2025 Aug;644(8076):430-441. doi: 10.1038/s41586-025-09140-6. PMID: 39372794 Free PMC article. Updated. Preprint.
  • A global map for introgressed structural variation and selection in humans.
    Hsieh P, Soisangwan N, Gordon DS, Javidh A, Harvey WT, Porubsky D, Hoekzema K, Baker C, Munson KM, Kinipi C, Leavesley M, Brucato N, Cox MP, Ricaut FX, Romero IG, Eichler EE. Hsieh P, et al. bioRxiv [Preprint]. 2025 Jun 24:2025.06.24.661368. doi: 10.1101/2025.06.24.661368. bioRxiv. 2025. PMID: 40667000 Free PMC article. Preprint.
  • Complex genetic variation in nearly complete human genomes.
    Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Scholz S, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. Nature. 2025 Aug;644(8076):430-441. doi: 10.1038/s41586-025-09140-6. Epub 2025 Jul 23. Nature. 2025. PMID: 40702183 Free PMC article.
  • Segmental duplication-mediated rearrangements alter the landscape of mouse genomes.
    Francoeur ER, Audano PA, Ferraj A, Balachandran P, Beck CR. Francoeur ER, et al. bioRxiv [Preprint]. 2025 Jul 22:2025.07.18.665526. doi: 10.1101/2025.07.18.665526. bioRxiv. 2025. PMID: 40777336 Free PMC article. Preprint.
  • Chromosomal quality control in hPSCs: A practical guide to SNP array analysis with GenomeStudio.
    Haake J, Steenpass L. Haake J, et al. Front Cell Dev Biol. 2025 Jul 1;13:1599923. doi: 10.3389/fcell.2025.1599923. eCollection 2025. Front Cell Dev Biol. 2025. PMID: 40666289 Free PMC article.

References

    1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature409, 860–921 (2001). - PubMed
    1. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res.11, 1005–1017 (2001). - PMC - PubMed
    1. Eichler, E. Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Hum. Mol. Genet.6, 991–1002 (1997). - PubMed
    1. Trask, B. J. et al. Members of the olfactory receptor gene family are contained in large blocks of DNA duplicated polymorphically near the ends of human chromosomes. Hum. Mol. Genet.7, 13–26 (1998). - PubMed
    1. Church, D. M. A next-generation human genome sequence. Science376, 34–35 (2022). - PubMed

LinkOut - more resources