. 2025 Aug;644(8076):430-441.

doi: 10.1038/s41586-025-09140-6. Epub 2025 Jul 23.

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon^#^{1

2}, Peter Ebert^#^{3

4}, Peter A Audano^#⁵, Mark Loftus^#^{6

7

5}, David Porubsky¹, Jana Ebler^{4

8}, Feyza Yilmaz⁵, Pille Hallast⁵, Timofey Prodanov^{4

8}, DongAhn Yoo¹, Carolyn A Paisie⁵, William T Harvey¹, Xuefang Zhao^{9

10

11}, Gianni V Martino^{6

7

12}, Mir Henglin^{4

8}, Katherine M Munson¹, Keon Rabbani¹³, Chen-Shan Chin¹⁴, Bida Gu¹³, Hufsah Ashraf^{4

8}, Stephan Scholz^{4

15}, Olanrewaju Austine-Orimoloye¹⁶, Parithi Balachandran⁵, Marc Jan Bonder^{17

18

19}, Haoyu Cheng²⁰, Zechen Chong²¹, Jonathan Crabtree²², Mark Gerstein^{23

24}, Lisbeth A Guethlein²⁵, Patrick Hasenfeld²⁶, Glenn Hickey²⁷, Kendra Hoekzema¹, Sarah E Hunt¹⁶, Matthew Jensen^{23

24}, Yunzhe Jiang^{23

24}, Sergey Koren²⁸, Youngjun Kwon¹, Chong Li^{29

30}, Heng Li^{31

32}, Jiaqi Li^{23

24}, Paul J Norman^{33

34}, Keisuke K Oshima², Benedict Paten²⁷, Adam M Phillippy²⁸, Nicholas R Pollock³³, Tobias Rausch²⁶, Mikko Rautiainen³⁵, Yuwei Song²¹, Arda Söylev^{4

8}, Arvis Sulovari¹, Likhitha Surapaneni¹⁶, Vasiliki Tsapalou²⁶, Weichen Zhou³⁶, Ying Zhou³¹, Qihui Zhu^{5

37}, Michael C Zody³⁸, Ryan E Mills³⁶, Scott E Devine²², Xinghua Shi^{29

30}, Michael E Talkowski^{9

10

11}, Mark J P Chaisson¹³, Alexander T Dilthey^{4

15}, Miriam K Konkel^{39

40}, Jan O Korbel⁴¹, Charles Lee⁴², Christine R Beck^{43

44}, Evan E Eichler^{45

46}, Tobias Marschall^{47

48}

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
² Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
³ Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
⁶ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.
⁷ Center for Human Genetics, Clemson University, Greenwood, SC, USA.
⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁹ Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
¹¹ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
¹² Medical University of South Carolina, College of Graduate Studies, Charleston, SC, USA.
¹³ Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
¹⁴ Pathos AI Inc., Chicago, IL, USA.
¹⁵ Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
¹⁶ European Molecular Biology Laboratory, Wellcome Genome Campus, European Bioinformatics Institute, Cambridge, UK.
¹⁷ Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands.
¹⁸ Oncode Institute, Utrecht, The Netherlands.
¹⁹ Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany.
²⁰ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
²¹ Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama, Birmingham, AL, USA.
²² Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
²³ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
²⁴ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
²⁵ Department of Structural Biology, School of Medicine, Stanford University, Stanford, CA, USA.
²⁶ Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
²⁷ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
²⁸ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
²⁹ Department of Computer and Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA.
³⁰ Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
³¹ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
³² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³³ Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
³⁴ Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA.
³⁵ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
³⁶ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
³⁷ Stanford Health Care, Palo Alto, CA, USA.
³⁸ New York Genome Center, New York, NY, USA.
³⁹ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA. mkonkel@clemson.edu.
⁴⁰ Center for Human Genetics, Clemson University, Greenwood, SC, USA. mkonkel@clemson.edu.
⁴¹ Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. jan.korbel@embl.org.
⁴² The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA. Charles.Lee@jax.org.
⁴³ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA. Christine.Beck@jax.org.
⁴⁴ The University of Connecticut Health Center, Farmington, CT, USA. Christine.Beck@jax.org.
⁴⁵ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA. ee3@uw.edu.
⁴⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. ee3@uw.edu.
⁴⁷ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁴⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.

^# Contributed equally.

PMID: 40702183
PMCID: PMC12350169
DOI: 10.1038/s41586-025-09140-6

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon et al. Nature. 2025 Aug.

. 2025 Aug;644(8076):430-441.

doi: 10.1038/s41586-025-09140-6. Epub 2025 Jul 23.

Authors

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
² Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
³ Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
⁶ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.
⁷ Center for Human Genetics, Clemson University, Greenwood, SC, USA.
⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁹ Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
¹¹ Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
¹² Medical University of South Carolina, College of Graduate Studies, Charleston, SC, USA.
¹³ Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
¹⁴ Pathos AI Inc., Chicago, IL, USA.
¹⁵ Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
¹⁶ European Molecular Biology Laboratory, Wellcome Genome Campus, European Bioinformatics Institute, Cambridge, UK.
¹⁷ Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands.
¹⁸ Oncode Institute, Utrecht, The Netherlands.
¹⁹ Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany.
²⁰ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
²¹ Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama, Birmingham, AL, USA.
²² Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
²³ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
²⁴ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
²⁵ Department of Structural Biology, School of Medicine, Stanford University, Stanford, CA, USA.
²⁶ Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
²⁷ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
²⁸ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
²⁹ Department of Computer and Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA.
³⁰ Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
³¹ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
³² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³³ Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
³⁴ Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA.
³⁵ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
³⁶ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
³⁷ Stanford Health Care, Palo Alto, CA, USA.
³⁸ New York Genome Center, New York, NY, USA.
³⁹ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA. mkonkel@clemson.edu.
⁴⁰ Center for Human Genetics, Clemson University, Greenwood, SC, USA. mkonkel@clemson.edu.
⁴¹ Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. jan.korbel@embl.org.
⁴² The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA. Charles.Lee@jax.org.
⁴³ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA. Christine.Beck@jax.org.
⁴⁴ The University of Connecticut Health Center, Farmington, CT, USA. Christine.Beck@jax.org.
⁴⁵ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA. ee3@uw.edu.
⁴⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA. ee3@uw.edu.
⁴⁷ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.
⁴⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany. tobias.marschall@hhu.de.

^# Contributed equally.

PMID: 40702183
PMCID: PMC12350169
DOI: 10.1038/s41586-025-09140-6

Erratum in

Author Correction: Complex genetic variation in nearly complete human genomes.
Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Scholz S, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. Nature. 2025 Sep;645(8081):E6. doi: 10.1038/s41586-025-09547-1. Nature. 2025. PMID: 40858940 Free PMC article. No abstract available.

Abstract

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (median continuity of 130 Mb), closing 92% of all previous assembly gaps^1,2 and reaching telomere-to-telomere status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8 and AMY1/AMY2, and fully resolve 1,852 complex structural variants. In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite higher-order repeat array length and characterize the pattern of mobile element insertions into α-satellite higher-order repeat arrays. Although most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference¹ significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference³ to a median quality value of 45. Using this approach, 26,115 structural variants per individual are detected, substantially increasing the number of structural variants now amenable to downstream disease association studies.

PubMed Disclaimer

Conflict of interest statement

Competing interests: E.E.E. is a scientific advisory board member of Variant Bio. C. Lee is a scientific advisory board member of Nabsys. S.K. has received travel funds to speak at events hosted by ONT. J.O.K., T.M. and D.P. have previously disclosed a patent application (no. EP19169090) relevant to Strand-seq. The other authors declare no competing interests.

Figures

**Fig. 1. LRS, assembly and variant calling of 65 diverse humans.**
a, Continental group (inner ring) and population group (outer ring) of the 65 diverse humans analysed in this study. AFR, African; AMR, American; EAS, East Asian; EUR, European; SAS, South Asian. Population groups are labelled according to the 1000 Genomes Project, along with the added Maasai in Kinyawa, Kenya (MKK) and Ashkenazim (ASK) labels. b, Scaffold auN for haplotype 1 (H1) and haplotype 2 (H2) contigs from each genome assembly. Data points are coloured by population group. The dashed lines indicate the median auN per haplotype. The dotted line indicates the unit diagonal. c, Quality value (QV) estimates for each genome assembly derived from variant calls or k-mer statistics (Methods). d, The number of chromosomes assembled from T2T for each genome assembly, including both single contigs and scaffolds (Methods). The median (solid line) and first and third quartiles (dotted lines) are shown. e, The number of T2T chromosomes in a single contig (dark blue, T2T contig) or in a single scaffold (light blue, T2T scaffold). Incomplete chromosomes are labelled as ‘not T2T’ or ‘missing’ if missing entirely. Sex chromosomes not present in the respective haploid assembly are labelled as ‘NA’. f, Cumulative non-redundant SVs across the diverse haplotypes in this study called with respect to the T2T-CHM13 reference genome (three trio children excluded). g, Number of SVs detected for each haplotype relative to the T2T-CHM13 reference genome, coloured by population. Insertions and deletions are balanced when called against the T2T-CHM13 reference genome but imbalanced when called against the GRCh38 reference genome (Extended Data Fig. 1d).

**Fig. 2. An improved genomic resource for challenging loci.**
a, Structure of a human Y chromosome, including the centromere (CEN; top), and repeat composition of five contiguously assembled Yq12 heterochromatic regions with their phylogenetic relationships (bottom left), size or number of *DYZ1* and *DYZ2* repeat array blocks (bottom right), and *Alu* insertion locations (triangles). ka, thousand years ago. b, Number of Iso-Seq reads that fail to align with 99% or less accuracy (left), and number of gigabases (Gb) of Iso-Seq reads that align with 99% or more accuracy (right) to the T2T-CHM13 reference genome versus the assemblies in this study. c, Expressed isoforms of *ZNF718* in NA19317. This individual is heterozygous for a deletion (red box, chr. 4: 127125–133267) that affects the *ZNF718* exon–intron structure. Isoforms not previously annotated in RefSeq, GENCODE or CHESS (Methods) are shown (yellow). LTR, long terminal repeat; SINE, short interspersed nuclear element; LINE, long interspersed nuclear element. d, Number of rare (allele frequency < 1%) SVs per sample in the HPRC-genotyped callset (grey), Illumina-based 1kGP-HC SV callset (orange), and combined HPRC and HGSVC-genotyped callset (blue) for both non-African (non-AFR) and African (AFR) individuals (n = 3,202). The first and third quartiles (Q1 and Q3, respectively; black boxes), median (white dots), and minima and maxima (black lines) are shown. e, Estimated k-mer-based QV for 60 haplotypes from the 1kGP-HC-phased set (GRCh38 based), HGSVC-phased genotypes using PanGenie, SHAPEIT5 (PG-SHAPEIT, T2T-CHM13 based) and all HGSVC genome assemblies. ‘Syntenic’ refers to regions of T2T-CHM13 also present in GRCh38. Baseline QV estimated by randomizing samples (red dashed line), first and third quartiles (black boxes), median (orange line), outliers (white dots) and whiskers (quantile 1 − 1.5(quantile 3–quantile 1) and quantile 3 + 1.5(quantile 3–quantile 1)) are shown. f, Haplotype availability, Locityper genotyping accuracy and trio concordance across 347 polymorphic loci in terms of variant-based QV. Availability and accuracy are calculated for 61 HGSVC individuals, whereas trio concordance is calculated for 602 trios. Full, HPRC + HGSVC; HPRC, HPRC only; HPRC + HGSVC*, HPRC + HGSVC leave-one-out.

**Fig. 3. Structurally variable regions of the MHC locus.**
a, Overview of the organization of the MHC locus into class I, class II and class III regions and the genes contained therein. Structurally variable regions are indicated by dashed lines. The coloured stripes show the approximate location of the regions analysed in b–d. b, Gene content and locations of solitary *HLA-DRB* exon 1 and intron 1 sequences in the HLA-DR region of the MHC locus by the DR group, an established system for classifying haplotypes in the HLA-DR region according to their gene or pseudogene structure and their *HLA-DRB1* allele. c, High-resolution repeat maps and locations of gene or pseudogene exons for different DR group haplotypes in the HLA-DR region, highlighting sequence homology between the DR1 and DR4/7/9 and DR2, and between the DR8 and DR3/5/6, haplotype groups, respectively. Also shown is the number of analysed MHC haplotypes per DR group. CR1, chicken repeat 1; ERV, endogenous retrovirus; MIR, mammalian interspersed repeat; snRNA, small nuclear RNA. d, Visualization of common and notable RCCX haplotype structures observed in the HGSVC MHC haplotypes, showing variation in gene and pseudogene content as well as the modular structure of RCCX (*STK19* (S), non-functional *CYP21A2* (black C), functional *CYP21A2* (white C) and *C4L*/S (long ((HERV-K insertion)/short(no HERV-K insertion))). e, Visualization of a PGR-TK analysis of 55 MHC loci and T2T-CHM13 for 111 haplotypes in total. The colours indicate the relative proportion of distinct DR group haplotypes flowing through the visualized elements.

**Fig. 4. Complex SVs in human populations.**
a, An SD-mediated CSV inverts *NBPF8* and deletes *NOTCH2NLR* and *NBPF26*. Inverted SD pairs (orange and yellow bands) each mediate a template switch (dashed lines ‘1’ and ‘2’). PAV refines alignment artefacts in large repeats surrounding CSVs to obtain a more accurate representation of these structures. The allele shown is HG00171 haplotype 1. b, Fraction of all assemblies having complete and accurate sequence over the SMN region, stratified by study (HPRC-Yr1 and HGSVC). c, Copy number (full and partial gene alignments) of each multi-copy gene (*SMN1/2* in red, *SERF1A/B* in green, *NAIP* in gold and *GTF2H2/C* in blue) across assembled haplotypes (n = 101). d, SMN duplications from 11 diverse human haplotypes assembled from this study, the HPRC (HG02486) and one *Pongo pygmaeus* haplotype (top) used as an outgroup. e, Summary of *SMN1* (yellow) and *SMN2* (red) gene copies genotyped across human haplotypes (n = 101). The yellow and red bars show a unique copy number of *SMN1* and *SMN2*, whereas the pie charts show their relative proportions in continental groups. The asterisks show haplotypes with only *SMN2* gene copies. f, The structure of the human amylase locus shows amylase genes (coloured arrows) and alignments between haplotypes (99–100% sequence identity). The H3^r.4 haplotype represents the most common haplotype, H5.15 and H7.2 are haplotypes previously unresolved at the base-pair level, and H11.1 is a previously unknown haplotype. Amylase gene annotations are displayed above each haplotype structure. The structure of each amylase haplotype, composed of amylase segments, is indicated by the coloured arrows. Sequence similarity between haplotypes ranges from 99% to 100%.

**Fig. 5. Variation in the sequence, structure and methylation pattern among 1,246 human centromeres.**
a, Length of the active α-satellite HOR array (arrays) for each complete and accurately assembled centromere from each genome. Each data point indicates an active α-satellite HOR array and is coloured by population group. The median length of all α-satellite HOR arrays is shown as a dashed line. For each chromosome, the median (solid line) and first and third quartiles (dashed lines) are shown. b, Sequence, structure and methylation (methyl.) map of centromeres from CHM13, CHM1 and a subset of 65 diverse human genomes. The α-satellite HORs are coloured by the number of α-satellite monomers within them, and the site of the putative kinetochore, indicated by the CDR, is shown. Mon., monomeric; div., divergent. c, Differences in the α-satellite HOR array organization and methylation patterns between the CHM13 and HG00513 (H1) chromosome 10 centromeres. The CDRs are located on highly identical sequences in both centromeres, despite their differing locations. d, MEIs in the chromosome 2 centromeric α-satellite HOR array. Most MEIs are consistent with duplications of the same element rather than distinct insertions, and all of them reside outside of the CDR. Var., variant.

**Extended Data Fig. 1. Statistics of long-read sequencing data and genome assemblies generated in this study as well as variant calls for 65 diverse human genomes.**
a) Fold coverage of the Pacific Biosciences (PacBio) high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) long-read sequencing data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. b) Read length N50 of the PacBio HiFi and ONT data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. c) Gene completeness as a percentage of BUSCO single-copy orthologs detected in each haplotype from each genome assembly (Methods). d) The number of SVs identified in one individual by 14 different SV callers, including PAV (Methods). Each bar is divided into four categories as follows: PAV, SVs identified by PAV (black); PAV (trimmed), false SVs from other callers in redundantly aligned sequences that PAV removes (red); Covered, SVs not called by PAV but within callable loci spanned by assembly alignments (dark gray); No assembly, SVs identified in locations not callable by PAV (light gray). Before applying caller-based QC, 99.75% of PAV calls are supported by at least one other call source. The individual evaluated is HG00171. e) Number of SVs called for each haplotype relative to the GRCh38 reference genome, colored by population. Insertions and deletions are imbalanced when called against the GRCh38 reference genome but balanced when called against the T2T-CHM13 reference genome (Fig. 1g). f) Number of SV insertions (left) and deletions (right) called against T2T-CHM13, GRCh38, or both reference genomes relative to their allele frequency. SVs called against both references tend to be rarer because they are less likely to appear in a reference genome. A sharp peak for high allele frequency (~1.0) for insertions is detected relative to the GRCh38 reference genome but not the T2T-CHM13 reference genome.

**Extended Data Fig. 2. Classification and distribution of changes in SD content in the 65 genomes.**
a) Number of segmentally duplicated bases assembled in different regions of the genome for each individual in this study, excluding sex chromosomes. The dashed line indicates the number of segmentally duplicated bases in the T2T-CHM13 genome. b) Segmental duplication (SD) accumulation curve. Starting with T2T-CHM13, the SDs (excluding those located in acrocentric regions and chrY) of 63 individuals (excluding NA19650 and NA19434) were projected onto T2T-CHM13 genome space in the continental group order of: EUR, AMR, EAS, SAS and AFR. For each bar, the SDs that are singleton, doubleton, polymorphic (>2) and shared (>90%) are indicated. The first bar is classified as “shared”, as the assembly is only being compared to itself. c) Schematic depicting the four categories of non-reference SDs: 1) new (i.e., unique in the reference), 2) expanded copy number, 3) content or composition changed, and 4) expanded and content changed SDs with respect to the SDs in the reference genome, T2T-CHM13. d) Quantification in terms of Mbp and predicted protein-coding genes across the four categories of new SDs compared to T2T-CHM13. The left panel shows the Mbp by category, while flagging those that are singleton (i.e., duplicated in T2T-CHM13 but not in other genomes). The right panel quantifies the number of complete (100% coverage) and partial overlaps (>50% coverage) with protein-coding genes for the respective chromosomes.

**Extended Data Fig. 3. Effects of SVs on gene expression, chromosome conformation, and complex traits.**
a) The percentage of Iso-Seq isoforms identified for each individual classified as previously identified in RefSeq (present in at least two individuals; blue), novel (present in at least two individuals; orange), individual-specific previously identified isoforms (red), or individual-specific novel (teal). b) Manhattan plot of the allele frequencies for 256 SVs disrupting protein-coding exons of 136 genes with expression present in Iso-Seq. Circled in red is the 6,142 bp polymorphic deletion in *ZNF718*. c) Comparison of the average unique isoforms in Iso-Seq phased to wild-type and variant haplotypes for 1,471 single SV-containing protein-coding genes. The color represents the type of SV [deletion (DEL): blue, insertion (INS): orange] and the shape indicates where the SV occurs in relation to the canonical transcript [circle: coding sequence (CDS), square: untranslated region (UTR), triangle: intron]. d) Proportion of genes located within 50 kbp of SV regions that show differential expression (DE; RNA-seq) among individuals who carry the SVs (red line), compared with the distribution of DE gene proportions nearby simulated SV regions (1,000 permutations). e) Enrichments and depletions of SVs within GENCODE v45 protein-coding, long noncoding RNA (lncRNA), and pseudogene elements, subdivided into various biotypes. *empirical p < 0.05 from 1,000 permutations with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate ±1 s.d. centered on the mean. p-values are listed in Supplementary Table 43. f) Enrichments and depletions of SVs within classes of ENCODE candidate cis-regulatory elements (cCREs). *empirical p < 0.05 from 1,000 permutations with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate ±1 s.d. centered on the mean. p-values are listed in Supplementary Table 59. g) A differentially insulated region in individuals with chr1-248444872-INS-63 SV, located nearby the DE gene *OR2T5*, suggests an SV-mediated novel chromatin domain could lead to increased gene expression. n = 7 individuals with the SV and 5 without the SV. Box plots indicate median and first and third quartiles, with whiskers extending to 1.5 times the interquartile range. Two-sided Wilcoxon rank-sum test with Benjamini-Hochberg correction. h) Number of SVs per chromosome that are in high (r² > 0.8) or perfect (r² = 1) linkage disequilibrium (LD) with GWAS SNPs significantly associated with diseases and human traits.

**Extended Data Fig. 4. Genotyping from short-read sequencing data.**
a) Completeness statistics for haplotypes produced from the 1kGP-HC phased set (GRCh38-based) and by genome inference with Pangenie followed by phasing (T2T-CHM13–based). To allow for comparison between the GRCh38- and T2T-CHM13-based callsets, we additionally restricted our analysis to “syntenic” regions of T2T-CHM13, i.e., excluding regions unique to T2T-CHM13. For both phased sets, completeness was computed on a subset of n = 30 individuals. The median is marked in yellow, and the lower and upper limits of each box represent lower and upper quartiles (Q1 and Q3). Lower and upper whiskers are defined as Q1 − 1.5(Q3–Q1) and Q3 + 1.5(Q3–Q1). b) Locityper genotyping accuracy for 10 target loci with the highest average variant-based QV improvement. c) Locityper genotyping results for HLA genes on 61 Illumina short-read HGSVC datasets using three reference panels: HPRC (90 haplotypes), leave-one-out HPRC + HGSVC (HPRC + HGSVC*, 214 haplotypes), and HPRC + HGSVC (full, 216 haplotypes). Accuracy is evaluated as the number of correctly identified allele fields in the corresponding gene nomenclature.

**Extended Data Fig. 5. Assembly of 1,246 human centromeres across 65 diverse human genomes show genetic and epigenetic variation.**
a) Number (left y-axis) and percentage (right y-axis) of centromeres that are completely and accurately assembled among 65 diverse human genomes, colored by population group. Mean, dashed line. **b,c**) Examples of di-kinetochores, defined as two CDRs located >80 kbp apart from each other, on the b) HG02953 chromosome 6 centromere and c) HG01573 chromosome 15 centromere. UL ONT reads span both CDRs in each case, indicating that the CDRs occur on the same chromosome in the cell population. d) Differences in the α-satellite HOR array organization and methylation patterns between the CHM13 and NA18989 (H1) chromosome 19 centromeres. The NA18989 (H1) chromosome 19 centromere has two CDRs, indicating the potential presence of a di-kinetochore.

See this image and copyright information in PMC

Update of

Complex genetic variation in nearly complete human genomes.
Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Scholz S, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. bioRxiv [Preprint]. 2024 Sep 25:2024.09.24.614721. doi: 10.1101/2024.09.24.614721. bioRxiv. 2024. Update in: Nature. 2025 Aug;644(8076):430-441. doi: 10.1038/s41586-025-09140-6. PMID: 39372794 Free PMC article. Updated. Preprint.

References

1. Liao, W.-W. et al. A draft human pangenome reference. Nature617, 312–324 (2023). - PMC - PubMed
1. Porubsky, D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res.33, 496–510 (2023). - PMC - PubMed
1. Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet.54, 518–525 (2022). - PMC - PubMed
1. Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2022). - PMC - PubMed
1. Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol.10.1038/s41587-020-0711-0 (2020). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Complex genetic variation in nearly complete human genomes

Affiliations

Complex genetic variation in nearly complete human genomes

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous