Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug;644(8076):430-441.
doi: 10.1038/s41586-025-09140-6. Epub 2025 Jul 23.

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon #  1   2 Peter Ebert #  3   4 Peter A Audano #  5 Mark Loftus #  6   7   5 David Porubsky  1 Jana Ebler  4   8 Feyza Yilmaz  5 Pille Hallast  5 Timofey Prodanov  4   8 DongAhn Yoo  1 Carolyn A Paisie  5 William T Harvey  1 Xuefang Zhao  9   10   11 Gianni V Martino  6   7   12 Mir Henglin  4   8 Katherine M Munson  1 Keon Rabbani  13 Chen-Shan Chin  14 Bida Gu  13 Hufsah Ashraf  4   8 Stephan Scholz  4   15 Olanrewaju Austine-Orimoloye  16 Parithi Balachandran  5 Marc Jan Bonder  17   18   19 Haoyu Cheng  20 Zechen Chong  21 Jonathan Crabtree  22 Mark Gerstein  23   24 Lisbeth A Guethlein  25 Patrick Hasenfeld  26 Glenn Hickey  27 Kendra Hoekzema  1 Sarah E Hunt  16 Matthew Jensen  23   24 Yunzhe Jiang  23   24 Sergey Koren  28 Youngjun Kwon  1 Chong Li  29   30 Heng Li  31   32 Jiaqi Li  23   24 Paul J Norman  33   34 Keisuke K Oshima  2 Benedict Paten  27 Adam M Phillippy  28 Nicholas R Pollock  33 Tobias Rausch  26 Mikko Rautiainen  35 Yuwei Song  21 Arda Söylev  4   8 Arvis Sulovari  1 Likhitha Surapaneni  16 Vasiliki Tsapalou  26 Weichen Zhou  36 Ying Zhou  31 Qihui Zhu  5   37 Michael C Zody  38 Ryan E Mills  36 Scott E Devine  22 Xinghua Shi  29   30 Michael E Talkowski  9   10   11 Mark J P Chaisson  13 Alexander T Dilthey  4   15 Miriam K Konkel  39   40 Jan O Korbel  41 Charles Lee  42 Christine R Beck  43   44 Evan E Eichler  45   46 Tobias Marschall  47   48
Affiliations

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon et al. Nature. 2025 Aug.

Erratum in

  • Author Correction: Complex genetic variation in nearly complete human genomes.
    Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Scholz S, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. Nature. 2025 Sep;645(8081):E6. doi: 10.1038/s41586-025-09547-1. Nature. 2025. PMID: 40858940 Free PMC article. No abstract available.

Abstract

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (median continuity of 130 Mb), closing 92% of all previous assembly gaps1,2 and reaching telomere-to-telomere status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8 and AMY1/AMY2, and fully resolve 1,852 complex structural variants. In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite higher-order repeat array length and characterize the pattern of mobile element insertions into α-satellite higher-order repeat arrays. Although most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference3 to a median quality value of 45. Using this approach, 26,115 structural variants per individual are detected, substantially increasing the number of structural variants now amenable to downstream disease association studies.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.E.E. is a scientific advisory board member of Variant Bio. C. Lee is a scientific advisory board member of Nabsys. S.K. has received travel funds to speak at events hosted by ONT. J.O.K., T.M. and D.P. have previously disclosed a patent application (no. EP19169090) relevant to Strand-seq. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. LRS, assembly and variant calling of 65 diverse humans.
a, Continental group (inner ring) and population group (outer ring) of the 65 diverse humans analysed in this study. AFR, African; AMR, American; EAS, East Asian; EUR, European; SAS, South Asian. Population groups are labelled according to the 1000 Genomes Project, along with the added Maasai in Kinyawa, Kenya (MKK) and Ashkenazim (ASK) labels. b, Scaffold auN for haplotype 1 (H1) and haplotype 2 (H2) contigs from each genome assembly. Data points are coloured by population group. The dashed lines indicate the median auN per haplotype. The dotted line indicates the unit diagonal. c, Quality value (QV) estimates for each genome assembly derived from variant calls or k-mer statistics (Methods). d, The number of chromosomes assembled from T2T for each genome assembly, including both single contigs and scaffolds (Methods). The median (solid line) and first and third quartiles (dotted lines) are shown. e, The number of T2T chromosomes in a single contig (dark blue, T2T contig) or in a single scaffold (light blue, T2T scaffold). Incomplete chromosomes are labelled as ‘not T2T’ or ‘missing’ if missing entirely. Sex chromosomes not present in the respective haploid assembly are labelled as ‘NA’. f, Cumulative non-redundant SVs across the diverse haplotypes in this study called with respect to the T2T-CHM13 reference genome (three trio children excluded). g, Number of SVs detected for each haplotype relative to the T2T-CHM13 reference genome, coloured by population. Insertions and deletions are balanced when called against the T2T-CHM13 reference genome but imbalanced when called against the GRCh38 reference genome (Extended Data Fig. 1d).
Fig. 2
Fig. 2. An improved genomic resource for challenging loci.
a, Structure of a human Y chromosome, including the centromere (CEN; top), and repeat composition of five contiguously assembled Yq12 heterochromatic regions with their phylogenetic relationships (bottom left), size or number of DYZ1 and DYZ2 repeat array blocks (bottom right), and Alu insertion locations (triangles). ka, thousand years ago. b, Number of Iso-Seq reads that fail to align with 99% or less accuracy (left), and number of gigabases (Gb) of Iso-Seq reads that align with 99% or more accuracy (right) to the T2T-CHM13 reference genome versus the assemblies in this study. c, Expressed isoforms of ZNF718 in NA19317. This individual is heterozygous for a deletion (red box, chr. 4: 127125–133267) that affects the ZNF718 exon–intron structure. Isoforms not previously annotated in RefSeq, GENCODE or CHESS (Methods) are shown (yellow). LTR, long terminal repeat; SINE, short interspersed nuclear element; LINE, long interspersed nuclear element. d, Number of rare (allele frequency < 1%) SVs per sample in the HPRC-genotyped callset (grey), Illumina-based 1kGP-HC SV callset (orange), and combined HPRC and HGSVC-genotyped callset (blue) for both non-African (non-AFR) and African (AFR) individuals (n = 3,202). The first and third quartiles (Q1 and Q3, respectively; black boxes), median (white dots), and minima and maxima (black lines) are shown. e, Estimated k-mer-based QV for 60 haplotypes from the 1kGP-HC-phased set (GRCh38 based), HGSVC-phased genotypes using PanGenie, SHAPEIT5 (PG-SHAPEIT, T2T-CHM13 based) and all HGSVC genome assemblies. ‘Syntenic’ refers to regions of T2T-CHM13 also present in GRCh38. Baseline QV estimated by randomizing samples (red dashed line), first and third quartiles (black boxes), median (orange line), outliers (white dots) and whiskers (quantile 1 − 1.5(quantile 3–quantile 1) and quantile 3 + 1.5(quantile 3–quantile 1)) are shown. f, Haplotype availability, Locityper genotyping accuracy and trio concordance across 347 polymorphic loci in terms of variant-based QV. Availability and accuracy are calculated for 61 HGSVC individuals, whereas trio concordance is calculated for 602 trios. Full, HPRC + HGSVC; HPRC, HPRC only; HPRC + HGSVC*, HPRC + HGSVC leave-one-out.
Fig. 3
Fig. 3. Structurally variable regions of the MHC locus.
a, Overview of the organization of the MHC locus into class I, class II and class III regions and the genes contained therein. Structurally variable regions are indicated by dashed lines. The coloured stripes show the approximate location of the regions analysed in bd. b, Gene content and locations of solitary HLA-DRB exon 1 and intron 1 sequences in the HLA-DR region of the MHC locus by the DR group, an established system for classifying haplotypes in the HLA-DR region according to their gene or pseudogene structure and their HLA-DRB1 allele. c, High-resolution repeat maps and locations of gene or pseudogene exons for different DR group haplotypes in the HLA-DR region, highlighting sequence homology between the DR1 and DR4/7/9 and DR2, and between the DR8 and DR3/5/6, haplotype groups, respectively. Also shown is the number of analysed MHC haplotypes per DR group. CR1, chicken repeat 1; ERV, endogenous retrovirus; MIR, mammalian interspersed repeat; snRNA, small nuclear RNA. d, Visualization of common and notable RCCX haplotype structures observed in the HGSVC MHC haplotypes, showing variation in gene and pseudogene content as well as the modular structure of RCCX (STK19 (S), non-functional CYP21A2 (black C), functional CYP21A2 (white C) and C4L/S (long ((HERV-K insertion)/short(no HERV-K insertion))). e, Visualization of a PGR-TK analysis of 55 MHC loci and T2T-CHM13 for 111 haplotypes in total. The colours indicate the relative proportion of distinct DR group haplotypes flowing through the visualized elements.
Fig. 4
Fig. 4. Complex SVs in human populations.
a, An SD-mediated CSV inverts NBPF8 and deletes NOTCH2NLR and NBPF26. Inverted SD pairs (orange and yellow bands) each mediate a template switch (dashed lines ‘1’ and ‘2’). PAV refines alignment artefacts in large repeats surrounding CSVs to obtain a more accurate representation of these structures. The allele shown is HG00171 haplotype 1. b, Fraction of all assemblies having complete and accurate sequence over the SMN region, stratified by study (HPRC-Yr1 and HGSVC). c, Copy number (full and partial gene alignments) of each multi-copy gene (SMN1/2 in red, SERF1A/B in green, NAIP in gold and GTF2H2/C in blue) across assembled haplotypes (n = 101). d, SMN duplications from 11 diverse human haplotypes assembled from this study, the HPRC (HG02486) and one Pongo pygmaeus haplotype (top) used as an outgroup. e, Summary of SMN1 (yellow) and SMN2 (red) gene copies genotyped across human haplotypes (n = 101). The yellow and red bars show a unique copy number of SMN1 and SMN2, whereas the pie charts show their relative proportions in continental groups. The asterisks show haplotypes with only SMN2 gene copies. f, The structure of the human amylase locus shows amylase genes (coloured arrows) and alignments between haplotypes (99–100% sequence identity). The H3r.4 haplotype represents the most common haplotype, H5.15 and H7.2 are haplotypes previously unresolved at the base-pair level, and H11.1 is a previously unknown haplotype. Amylase gene annotations are displayed above each haplotype structure. The structure of each amylase haplotype, composed of amylase segments, is indicated by the coloured arrows. Sequence similarity between haplotypes ranges from 99% to 100%.
Fig. 5
Fig. 5. Variation in the sequence, structure and methylation pattern among 1,246 human centromeres.
a, Length of the active α-satellite HOR array (arrays) for each complete and accurately assembled centromere from each genome. Each data point indicates an active α-satellite HOR array and is coloured by population group. The median length of all α-satellite HOR arrays is shown as a dashed line. For each chromosome, the median (solid line) and first and third quartiles (dashed lines) are shown. b, Sequence, structure and methylation (methyl.) map of centromeres from CHM13, CHM1 and a subset of 65 diverse human genomes. The α-satellite HORs are coloured by the number of α-satellite monomers within them, and the site of the putative kinetochore, indicated by the CDR, is shown. Mon., monomeric; div., divergent. c, Differences in the α-satellite HOR array organization and methylation patterns between the CHM13 and HG00513 (H1) chromosome 10 centromeres. The CDRs are located on highly identical sequences in both centromeres, despite their differing locations. d, MEIs in the chromosome 2 centromeric α-satellite HOR array. Most MEIs are consistent with duplications of the same element rather than distinct insertions, and all of them reside outside of the CDR. Var., variant.
Extended Data Fig. 1
Extended Data Fig. 1. Statistics of long-read sequencing data and genome assemblies generated in this study as well as variant calls for 65 diverse human genomes.
a) Fold coverage of the Pacific Biosciences (PacBio) high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) long-read sequencing data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. b) Read length N50 of the PacBio HiFi and ONT data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. c) Gene completeness as a percentage of BUSCO single-copy orthologs detected in each haplotype from each genome assembly (Methods). d) The number of SVs identified in one individual by 14 different SV callers, including PAV (Methods). Each bar is divided into four categories as follows: PAV, SVs identified by PAV (black); PAV (trimmed), false SVs from other callers in redundantly aligned sequences that PAV removes (red); Covered, SVs not called by PAV but within callable loci spanned by assembly alignments (dark gray); No assembly, SVs identified in locations not callable by PAV (light gray). Before applying caller-based QC, 99.75% of PAV calls are supported by at least one other call source. The individual evaluated is HG00171. e) Number of SVs called for each haplotype relative to the GRCh38 reference genome, colored by population. Insertions and deletions are imbalanced when called against the GRCh38 reference genome but balanced when called against the T2T-CHM13 reference genome (Fig. 1g). f) Number of SV insertions (left) and deletions (right) called against T2T-CHM13, GRCh38, or both reference genomes relative to their allele frequency. SVs called against both references tend to be rarer because they are less likely to appear in a reference genome. A sharp peak for high allele frequency (~1.0) for insertions is detected relative to the GRCh38 reference genome but not the T2T-CHM13 reference genome.
Extended Data Fig. 2
Extended Data Fig. 2. Classification and distribution of changes in SD content in the 65 genomes.
a) Number of segmentally duplicated bases assembled in different regions of the genome for each individual in this study, excluding sex chromosomes. The dashed line indicates the number of segmentally duplicated bases in the T2T-CHM13 genome. b) Segmental duplication (SD) accumulation curve. Starting with T2T-CHM13, the SDs (excluding those located in acrocentric regions and chrY) of 63 individuals (excluding NA19650 and NA19434) were projected onto T2T-CHM13 genome space in the continental group order of: EUR, AMR, EAS, SAS and AFR. For each bar, the SDs that are singleton, doubleton, polymorphic (>2) and shared (>90%) are indicated. The first bar is classified as “shared”, as the assembly is only being compared to itself. c) Schematic depicting the four categories of non-reference SDs: 1) new (i.e., unique in the reference), 2) expanded copy number, 3) content or composition changed, and 4) expanded and content changed SDs with respect to the SDs in the reference genome, T2T-CHM13. d) Quantification in terms of Mbp and predicted protein-coding genes across the four categories of new SDs compared to T2T-CHM13. The left panel shows the Mbp by category, while flagging those that are singleton (i.e., duplicated in T2T-CHM13 but not in other genomes). The right panel quantifies the number of complete (100% coverage) and partial overlaps (>50% coverage) with protein-coding genes for the respective chromosomes.
Extended Data Fig. 3
Extended Data Fig. 3. Effects of SVs on gene expression, chromosome conformation, and complex traits.
a) The percentage of Iso-Seq isoforms identified for each individual classified as previously identified in RefSeq (present in at least two individuals; blue), novel (present in at least two individuals; orange), individual-specific previously identified isoforms (red), or individual-specific novel (teal). b) Manhattan plot of the allele frequencies for 256 SVs disrupting protein-coding exons of 136 genes with expression present in Iso-Seq. Circled in red is the 6,142 bp polymorphic deletion in ZNF718. c) Comparison of the average unique isoforms in Iso-Seq phased to wild-type and variant haplotypes for 1,471 single SV-containing protein-coding genes. The color represents the type of SV [deletion (DEL): blue, insertion (INS): orange] and the shape indicates where the SV occurs in relation to the canonical transcript [circle: coding sequence (CDS), square: untranslated region (UTR), triangle: intron]. d) Proportion of genes located within 50 kbp of SV regions that show differential expression (DE; RNA-seq) among individuals who carry the SVs (red line), compared with the distribution of DE gene proportions nearby simulated SV regions (1,000 permutations). e) Enrichments and depletions of SVs within GENCODE v45 protein-coding, long noncoding RNA (lncRNA), and pseudogene elements, subdivided into various biotypes. *empirical p < 0.05 from 1,000 permutations with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate ±1 s.d. centered on the mean. p-values are listed in Supplementary Table 43. f) Enrichments and depletions of SVs within classes of ENCODE candidate cis-regulatory elements (cCREs). *empirical p < 0.05 from 1,000 permutations with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate ±1 s.d. centered on the mean. p-values are listed in Supplementary Table 59. g) A differentially insulated region in individuals with chr1-248444872-INS-63 SV, located nearby the DE gene OR2T5, suggests an SV-mediated novel chromatin domain could lead to increased gene expression. n = 7 individuals with the SV and 5 without the SV. Box plots indicate median and first and third quartiles, with whiskers extending to 1.5 times the interquartile range. Two-sided Wilcoxon rank-sum test with Benjamini-Hochberg correction. h) Number of SVs per chromosome that are in high (r2 > 0.8) or perfect (r2 = 1) linkage disequilibrium (LD) with GWAS SNPs significantly associated with diseases and human traits.
Extended Data Fig. 4
Extended Data Fig. 4. Genotyping from short-read sequencing data.
a) Completeness statistics for haplotypes produced from the 1kGP-HC phased set (GRCh38-based) and by genome inference with Pangenie followed by phasing (T2T-CHM13–based). To allow for comparison between the GRCh38- and T2T-CHM13-based callsets, we additionally restricted our analysis to “syntenic” regions of T2T-CHM13, i.e., excluding regions unique to T2T-CHM13. For both phased sets, completeness was computed on a subset of n = 30 individuals. The median is marked in yellow, and the lower and upper limits of each box represent lower and upper quartiles (Q1 and Q3). Lower and upper whiskers are defined as Q1 − 1.5(Q3–Q1) and Q3 + 1.5(Q3–Q1). b) Locityper genotyping accuracy for 10 target loci with the highest average variant-based QV improvement. c) Locityper genotyping results for HLA genes on 61 Illumina short-read HGSVC datasets using three reference panels: HPRC (90 haplotypes), leave-one-out HPRC + HGSVC (HPRC + HGSVC*, 214 haplotypes), and HPRC + HGSVC (full, 216 haplotypes). Accuracy is evaluated as the number of correctly identified allele fields in the corresponding gene nomenclature.
Extended Data Fig. 5
Extended Data Fig. 5. Assembly of 1,246 human centromeres across 65 diverse human genomes show genetic and epigenetic variation.
a) Number (left y-axis) and percentage (right y-axis) of centromeres that are completely and accurately assembled among 65 diverse human genomes, colored by population group. Mean, dashed line. b,c) Examples of di-kinetochores, defined as two CDRs located >80 kbp apart from each other, on the b) HG02953 chromosome 6 centromere and c) HG01573 chromosome 15 centromere. UL ONT reads span both CDRs in each case, indicating that the CDRs occur on the same chromosome in the cell population. d) Differences in the α-satellite HOR array organization and methylation patterns between the CHM13 and NA18989 (H1) chromosome 19 centromeres. The NA18989 (H1) chromosome 19 centromere has two CDRs, indicating the potential presence of a di-kinetochore.

Update of

  • Complex genetic variation in nearly complete human genomes.
    Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Scholz S, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. bioRxiv [Preprint]. 2024 Sep 25:2024.09.24.614721. doi: 10.1101/2024.09.24.614721. bioRxiv. 2024. Update in: Nature. 2025 Aug;644(8076):430-441. doi: 10.1038/s41586-025-09140-6. PMID: 39372794 Free PMC article. Updated. Preprint.

References

    1. Liao, W.-W. et al. A draft human pangenome reference. Nature617, 312–324 (2023). - PMC - PubMed
    1. Porubsky, D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res.33, 496–510 (2023). - PMC - PubMed
    1. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet.54, 518–525 (2022). - PMC - PubMed
    1. Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2022). - PMC - PubMed
    1. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol.10.1038/s41587-020-0711-0 (2020). - PMC - PubMed

LinkOut - more resources