This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Sep 25:2024.09.24.614721.

doi: 10.1101/2024.09.24.614721.

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon^{1

2}, Peter Ebert^{3

4}, Peter A Audano⁵, Mark Loftus^{6

7}, David Porubsky², Jana Ebler^{8

4}, Feyza Yilmaz⁵, Pille Hallast⁵, Timofey Prodanov^{8

4}, DongAhn Yoo², Carolyn A Paisie⁵, William T Harvey², Xuefang Zhao^{9

10}, Gianni V Martino^{6

7

11}, Mir Henglin^{8

4}, Katherine M Munson², Keon Rabbani¹², Chen-Shan Chin¹³, Bida Gu¹², Hufsah Ashraf^{8

4}, Olanrewaju Austine-Orimoloye¹⁴, Parithi Balachandran⁵, Marc Jan Bonder^{15

16}, Haoyu Cheng¹⁷, Zechen Chong¹⁸, Jonathan Crabtree¹⁹, Mark Gerstein^{20

21}, Lisbeth A Guethlein²², Patrick Hasenfeld²³, Glenn Hickey²⁴, Kendra Hoekzema², Sarah E Hunt¹⁴, Matthew Jensen^{20

21}, Yunzhe Jiang^{20

21}, Sergey Koren²⁵, Youngjun Kwon², Chong Li^{26

27}, Heng Li^{28

29}, Jiaqi Li^{20

21}, Paul J Norman^{30

31}, Keisuke K Oshima¹, Benedict Paten²⁴, Adam M Phillippy²⁵, Nicholas R Pollock³⁰, Tobias Rausch²³, Mikko Rautiainen³², Stephan Scholz³³, Yuwei Song¹⁸, Arda Söylev^{8

4}, Arvis Sulovari², Likhitha Surapaneni¹⁴, Vasiliki Tsapalou²³, Weichen Zhou³⁴, Ying Zhou^{28

29}, Qihui Zhu^{5

35}, Michael C Zody³⁶, Ryan E Mills³⁴, Scott E Devine¹⁹, Xinghua Shi^{26

27}, Mike E Talkowski^{9

10

37}, Mark J P Chaisson¹², Alexander T Dilthey^{4

33}, Miriam K Konkel^{6

7}, Jan O Korbel²³, Charles Lee⁵, Christine R Beck^{5

38}, Evan E Eichler^{2

39}, Tobias Marschall^{8

4}

Affiliations

¹ Perelman School of Medicine, University of Pennsylvania, Department of Genetics, Epigenetics Institute, Philadelphia, PA, USA.
² Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
³ Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
⁶ Clemson University, Department of Genetics & Biochemistry, Clemson, SC, USA.
⁷ Center for Human Genetics, Clemson University, Greenwood, SC, USA.
⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
¹¹ Medical University of South Carolina, College of Graduate Studies, Charleston, SC, USA.
¹² Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
¹³ Foundation of Biological Data Sciences, Belmont, CA, USA.
¹⁴ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom.
¹⁵ Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands; Oncode Institute, Utrecht, The Netherlands.
¹⁶ Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany.
¹⁷ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
¹⁸ Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama, Birmingham, AL, USA.
¹⁹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
²⁰ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
²¹ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
²² Department of Structural Biology, School of Medicine, Stanford University, Stanford, CA, USA.
²³ European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany.
²⁴ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
²⁵ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
²⁶ Temple University, Department of Computer and Information Sciences, College of Science and Technology, Philadelphia, PA, USA.
²⁷ Temple University, Institute for Genomics and Evolutionary Medicine, Philadelphia, PA, USA.
²⁸ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
²⁹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³⁰ Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
³¹ Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA.
³² Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
³³ Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
³⁴ Department of Computational Medicine & Bioinformatics, University of Michigan, MI, USA.
³⁵ Stanford Health Care, Palo Alto, CA, USA.
³⁶ New York Genome Center, New York, NY, USA.
³⁷ Department of Neurology, Harvard Medical School, Boston, MA, USA.
³⁸ The University of Connecticut Health Center, Farmington, CT, USA.
³⁹ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

PMID: 39372794
PMCID: PMC11451754
DOI: 10.1101/2024.09.24.614721

Complex genetic variation in nearly complete human genomes

Glennis A Logsdon et al. bioRxiv. 2024.

[Preprint]. 2024 Sep 25:2024.09.24.614721.

doi: 10.1101/2024.09.24.614721.

Authors

Affiliations

¹ Perelman School of Medicine, University of Pennsylvania, Department of Genetics, Epigenetics Institute, Philadelphia, PA, USA.
² Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
³ Core Unit Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁴ Center for Digital Medicine, Heinrich Heine University, Düsseldorf, Germany.
⁵ The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.
⁶ Clemson University, Department of Genetics & Biochemistry, Clemson, SC, USA.
⁷ Center for Human Genetics, Clemson University, Greenwood, SC, USA.
⁸ Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, Düsseldorf, Germany.
⁹ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹⁰ Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
¹¹ Medical University of South Carolina, College of Graduate Studies, Charleston, SC, USA.
¹² Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
¹³ Foundation of Biological Data Sciences, Belmont, CA, USA.
¹⁴ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom.
¹⁵ Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands; Oncode Institute, Utrecht, The Netherlands.
¹⁶ Division of Computational Genomics and Systems Genetics, German Cancer Research Center, Heidelberg, Germany.
¹⁷ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
¹⁸ Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama, Birmingham, AL, USA.
¹⁹ Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA.
²⁰ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
²¹ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
²² Department of Structural Biology, School of Medicine, Stanford University, Stanford, CA, USA.
²³ European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany.
²⁴ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.
²⁵ Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
²⁶ Temple University, Department of Computer and Information Sciences, College of Science and Technology, Philadelphia, PA, USA.
²⁷ Temple University, Institute for Genomics and Evolutionary Medicine, Philadelphia, PA, USA.
²⁸ Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
²⁹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³⁰ Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
³¹ Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA.
³² Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
³³ Institute of Medical Microbiology and Hospital Hygiene, Medical Faculty, Heinrich Heine University, Düsseldorf, Germany.
³⁴ Department of Computational Medicine & Bioinformatics, University of Michigan, MI, USA.
³⁵ Stanford Health Care, Palo Alto, CA, USA.
³⁶ New York Genome Center, New York, NY, USA.
³⁷ Department of Neurology, Harvard Medical School, Boston, MA, USA.
³⁸ The University of Connecticut Health Center, Farmington, CT, USA.
³⁹ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

PMID: 39372794
PMCID: PMC11451754
DOI: 10.1101/2024.09.24.614721

Update in

Complex genetic variation in nearly complete human genomes.
Logsdon GA, Ebert P, Audano PA, Loftus M, Porubsky D, Ebler J, Yilmaz F, Hallast P, Prodanov T, Yoo D, Paisie CA, Harvey WT, Zhao X, Martino GV, Henglin M, Munson KM, Rabbani K, Chin CS, Gu B, Ashraf H, Scholz S, Austine-Orimoloye O, Balachandran P, Bonder MJ, Cheng H, Chong Z, Crabtree J, Gerstein M, Guethlein LA, Hasenfeld P, Hickey G, Hoekzema K, Hunt SE, Jensen M, Jiang Y, Koren S, Kwon Y, Li C, Li H, Li J, Norman PJ, Oshima KK, Paten B, Phillippy AM, Pollock NR, Rausch T, Rautiainen M, Song Y, Söylev A, Sulovari A, Surapaneni L, Tsapalou V, Zhou W, Zhou Y, Zhu Q, Zody MC, Mills RE, Devine SE, Shi X, Talkowski ME, Chaisson MJP, Dilthey AT, Konkel MK, Korbel JO, Lee C, Beck CR, Eichler EE, Marschall T. Logsdon GA, et al. Nature. 2025 Aug;644(8076):430-441. doi: 10.1038/s41586-025-09140-6. Epub 2025 Jul 23. Nature. 2025. PMID: 40702183 Free PMC article.

Abstract

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here, we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps^1,2 and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs). In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into α-satellite HOR arrays. While most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference¹ significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference³ to a median quality value (QV) of 45. Using this approach, 26,115 SVs per sample are detected, substantially increasing the number of SVs now amenable to downstream disease association studies.

PubMed Disclaimer

Conflict of interest statement

Competing Interests E.E.E. is a scientific advisory board member of Variant Bio, Inc. C. Lee is a scientific advisory board member of Nabsys and Genome Insight. S.K. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. The following authors have previously disclosed a patent application (No. EP19169090) relevant to Strand-seq: J.O.K., T.M., and D.P. The other authors declare no competing interests.

Figures

**Extended Data Figure 1.. Statistics of long-read sequencing data and genome assemblies generated in this study as well as variant calls for 65 diverse human genomes.**
a) Fold coverage of the Pacific Biosciences (PacBio) high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) long-read sequencing data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. b) Read length N50 of the PacBio HiFi and ONT data generated for each genome in this study. The median (solid line) and first and third quartiles (dotted lines) are shown. c) Gene completeness as a percentage of BUSCO single-copy orthologs detected in each haplotype from each genome assembly (Methods). d) The number of structural variants (SVs) detected by the Phased Assembly Variant (PAV) caller. Before applying caller-based QC, 99.75% of PAV calls are supported by at least one other call source. PAV, variant supported by PAV; PAV (trimmed), variant was removed when PAV trimmed repetitive bases mapped multiple times; Covered, region covered by an assembly, but no comparable SV found by PAV; No Assembly, SV occurs in a region where an assembly sequence was not aligned. e) Number of SVs called for each haplotype relative to the GRCh38 reference genome, colored by population. Insertions and deletions are imbalanced when called against the GRCh38 reference genome but balanced when called against the T2T-CHM13 reference genome (Fig. 1g). f) Number of SV insertions (left) and deletions (right) called against the T2T-CHM13 reference genome, GRCh38 reference genome, or both relative to their allele frequency. SVs called against both references tend to be more rare because they are less likely to appear in a reference genome. A sharp peak for high allele frequency (~1.0) for insertions is detected relative to the GRCh38 reference genome but not the T2T-CHM13 reference genome.

**Extended Data Figure 2.. Classification and distribution of changes in SD content in the 65 genomes.**
a) Schematic depicting the four categories of non-reference SDs: 1) new (i.e., unique in the reference), 2) expanded copy number, 3) content or composition changed, and 4) expanded and content changed SDs with respect to the SDs in the reference genome, T2T-CHM13. b) Quantification in terms of Mbp and predicted protein-coding genes across the four categories of new SDs compared to T2T-CHM13. The left panel shows the Mbp by category, while flagging those that are singleton (i.e., duplicated in T2T-CHM13 but not in other genomes). The right panel quantifies the number of complete (100% coverage) and partial overlaps (>50% coverage) with protein-coding genes for the respective chromosomes.

**Extended Data Figure 3.. Effects of SVs on gene expression, chromosome conformation, and complex traits.**
a) The percentage of Iso-Seq isoforms identified for each sample classified as novel (present in at least two samples; orange), previously identified in RefSeq (present in at least two samples; blue), sample-specific novel (teal), or sample-specific previously identified isoforms (red). b) Manhattan plot of the allele frequencies for 256 SVs disrupting protein-coding exons of 136 genes with expression present in Iso-Seq. Circled in red is the 6,142 bp polymorphic deletion in *ZNF718*. c) Comparison of the average unique isoforms in Iso-Seq phased to wild-type and variant haplotypes for 1,471 single SV-containing protein-coding genes. The color represents the type of SV (deletion: blue, insertion: orange) and the shape indicates where the SV occurs in relation to the canonical transcript (circle: coding sequence [CDS], square: UTR, triangle: intron)**. d)** Proportion of genes located within 50 kbp of SV regions that show differential expression (DE) (RNA-seq) among individuals who carry the SVs (red line), compared with the distribution of DE gene proportions nearby simulated SV regions (1,000 permutations). e) Enrichments and depletions of SVs within GENCODE v45 protein-coding, long noncoding RNA (lncRNA), and pseudogene elements, subdivided into various biotypes. *empirical p<0.05 with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate s.d. f) Enrichments and depletions of SVs within classes of ENCODE candidate cis-regulatory elements (cCREs). *empirical p<0.05 with Benjamini-Hochberg correction. ns, nonsignificant. Error bars indicate s.d. g) A differentially insulated region (DIR) in individuals with chr1–248444872-INS-63 SV, located nearby the DE gene *OR2T5*, suggests an SV-mediated novel chromatin domain could lead to increased gene expression. Box plots indicate first and third quartile, with whiskers extending to 1.5 times the interquartile range. h) Number of SVs per chromosome that are in high (r²>0.8) or perfect (r²=1) linkage disequilibrium (LD) with GWAS SNPs significantly associated with diseases and human traits.

**Extended Data Figure 4.. Locityper genotyping accuracy across 33 genes/pseudogenes, located at the MHC locus.**
Genotyping was performed for 61 Illumina short-read HGSVC datasets using three reference panels: HPRC (90 haplotypes), leave-one-out HPRC + HGSVC (LOO, 214 haplotypes), and HPRC + HGSVC (full, 216 haplotypes). Accuracy is evaluated as the number of correctly identified allele fields in the corresponding gene nomenclature.

**Extended Data Figure 5.. Assembly of 1,246 human centromeres across 65 diverse human genomes show genetic and epigenetic variation.**
a) Number of completely and accurately assembled centromeres across 65 diverse human genomes, colored by population group. Mean, dashed line. **b,c)** Examples of di-kinetochores, defined as two CDRs located >80 kbp apart from each other, on the b) HG02953 Chromosome 6 centromere and c) HG01573 Chromosome 15 centromere. Ultra-long ONT reads span both CDRs in each case, indicating that the CDRs occur on the same chromosome in the cell population. d) Differences in the ɑ-satellite HOR array organization and methylation patterns between the CHM13 and NA18989 (H1) chromosome 19 centromeres. The NA18989 (H1) chromosome 19 centromere has two CDRs, indicating the potential presence of a di-kinetochore. e) Mobile element insertions (MEIs) in the Chromosome 2 centromeric α-satellite HOR array. Most MEIs are consistent with duplications of the same element rather than distinct insertions, and all of them reside outside of the CDR.

**Figure 1.. Long-read sequencing, assembly, and variant calling of 65 diverse human samples.**
a) Continental group (inner ring) and population group (outer ring) of the 65 diverse human samples analyzed in this study. b) Scaffold auN for haplotype 1 (H1) and haplotype 2 (H2) contigs from each genome assembly. Data points are color-coded by population and sex. Dashed lines indicate the median auN per haplotype. The dotted line indicates the unit diagonal. c) QV estimates for each genome assembly derived from variant calls or k-mer statistics (Methods). d) The number of chromosomes assembled from telomere-to-telomere (T2T) for each genome assembly, including both single contigs and scaffolds (Methods). The median (solid line) and first and third quartiles (dotted lines) are shown. e) The number of T2T chromosomes in a single contig (dark blue, T2T contig) or in a single scaffold (light blue, T2T scaffold). Incomplete chromosomes are labeled as “Not T2T” or “Missing” if missing entirely. Sex chromosomes not present in the respective haploid assembly are labeled as “N/A”. f) Cumulative nonredundant structural variants (SVs) across the diverse haplotypes in this study called with respect to the T2T-CHM13 reference genome (three trio children excluded). g) Number of SVs detected for each haplotype relative to the T2T-CHM13 reference genome, colored by population. Insertions and deletions are balanced when called against the T2T-CHM13 reference genome but imbalanced when called against the GRCh38 reference genome (Extended Data Fig. 1d).

**Figure 2.. An improved genomic resource for challenging loci.**
a) Number of segmentally duplicated bases assembled in different regions of the genome for each sample in this study, excluding sex chromosomes. The dashed line indicates the number of segmentally duplicated bases in the T2T-CHM13 genome. b) Segmental duplication (SD) accumulation curve. Starting with T2T-CHM13, the SDs (excluding those located in acrocentric regions and chrY) of 63 samples (excluding NA19650 and NA19434) were projected onto T2T-CHM13 genome space in the continental group order of: EUR, AMR, EAS, SAS and AFR. For each bar, the SDs that are singleton, doubleton, polymorphic (>2) and shared (>90%) are indicated. c) Structure of a human Y chromosome on the basis of T2T-CHM13 chromosome Y reference sequence, including the centromere (CEN; top). On the bottom, repeat composition of four contiguously assembled Yq12 heterochromatic regions with their phylogenetic relationships shown on the left. The size of the region and the number of *DYZ1* and *DYZ2* repeat array blocks are shown on the right. Locations of four inserted and subsequently amplified *Alu* elements on Yq12 are shown as triangles. d) Comparison of total Iso-Seq reads that failed to align at ≥99% accuracy for T2T-CHM13 vs. the assemblies in this study (left), and comparison of total bases aligned to T2T-CHM13 vs. the assemblies in this study among reads that aligned to both at ≥99% accuracy (right). e) Expressed isoforms of *ZNF718* identified in NA19317. This individual is heterozygous for a deletion that impacts the exon-intron structure of *ZNF718* (deleting exons 2 and 3 and part of the alternate first exon 1b). Repeat classes are annotated by color at the bottom. The wild-type allele harbors a single, previously unreported isoform consisting of a canonical first exon and second exon that is typically reported as alternate first exon 1b (yellow, wild-type). The presence of the 6,142 bp long deletion on chr4:127,125–133,267 is associated with four isoforms not previously annotated in RefSeq, GENCODE, or CHESS (variant, yellow). All four novel isoforms begin at the canonical transcription start site, contain part of exon 1b, and lack canonical exons 2 and 3.

**Figure 3.. Genotyping from short-read sequencing data.**
a) Number of rare SVs, defined as those with an allele frequency of <1%, in each callset. We compared the HPRC genotyped callset (gray), the Illumina-based 1kGP-HC SV callset (orange), the combined HPRC and HGSVC genotyped callset (blue) for both non-African (non-AFR) and African (AFR) samples (n=3,202). The boxes inside the violins represent the first and third quartiles of the data, white dots represent the medians, and black lines mark minima and maxima of the data. b) Estimated QV for a subset of 60 haplotypes (Supplementary Methods) from the 1kGP-HC phased set (GRCh38-based), HGSVC phased genotypes (T2T-CHM13-based), and all HGSVC genome assemblies. To allow comparison between the GRCh38- and T2T-CHM13-based sets, we additionally restricted our QV analysis to “syntenic” regions of T2T-CHM13, i.e., excluding regions unique to T2T-CHM13. The red dotted line corresponds to the baseline QV that we estimated by randomizing sample labels (i.e., using PanGenie-based consensus haplotypes and reads from different samples). The median is marked in yellow and the lower and upper limits of each box represent lower and upper quartiles (Q1 and Q3). Lower and upper whiskers are defined as Q1 − 1.5(Q3–Q1) and Q3 + 1.5(Q3–Q1), and dots mark the outliers. c) Completeness statistics for haplotypes produced from the 1kGP-HC phased set (GRCh38-based) and the HGSVC phased genotypes (T2T-CHM13–based). To allow for comparison between the GRCh38- and T2T-CHM13-based callsets, we additionally restricted our analysis to “syntenic” regions of T2T-CHM13, i.e., excluding regions unique to T2T-CHM13. For both phased sets, completeness was computed on a subset of 30 samples. d) Haplotype availability, Locityper genotyping accuracy, and trio concordance across 347 polymorphic loci. Availability and accuracy are calculated for 61 HGSVC samples, while trio concordance is calculated for 602 trios. Results are grouped by the reference panel [HPRC-only, HPRC + HGSVC leave-one-out (LOO), and HPRC + HGSVC]. e) Locityper genotyping accuracy for 10 target loci with the highest average QV improvement.

**Figure 4.. Structurally variable regions of the MHC locus.**
a) Overview of the organization of the MHC locus into class I, class II, and class III regions and the genes contained therein. Structurally variable regions are indicated by dashed lines. Colored stripes show the approximate location of the regions analyzed in panels b-d. b) Gene content and locations of solitary *HLA-DRB* exon 1 and intron 1 sequences in the HLA-DR region of the MHC locus by DR group, an established system for classifying haplotypes in the HLA-DR region according to their gene/pseudogene structure and their *HLA-DRB1* allele. Also shown is the number of analyzed MHC haplotypes per DR group. c) High-resolution repeat maps and locations of gene/pseudogene exons for different DR group haplotypes in the HLA-DR region, highlighting sequence homology between the DR1 and DR4/7/9 and DR2, and between the DR8 and DR3/5/6, haplotype groups, respectively. d) Visualization of common and notable RCCX haplotype structures observed in the HGSVC MHC haplotypes, showing variation in gene and pseudogene content as well as the modular structure of RCCX (S, *STK19*; black C, nonfunctional *CYP21A2*; white C, functional *CYP21A2*; *C4L*/S, long [(HERV-K insertion)/short(no HERV-K insertion)]. e) Visualization of a PGR-TK analysis of 55 MHC samples and T2T-CHM13 for 111 haplotypes in total. Colors indicate the relative proportion of distinct DR group haplotypes flowing through the visualized elements.

**Figure 5.. Complex SVs in human populations.**
a) An SD-mediated CSV inverts *NBPF8* and deletes two genes. Inverted SD pairs (orange and yellow bands) each mediate a template switch (dashed lines “1” and “2”). The resulting CSV inverts *NBPF8* and deletes *NOTCH2NLR* and *NBPF26*. The single recombined copy of each SD is aligned to both reference copies, obscuring the structure of the complex event by eliminating one deletion and changing the size of the inversion and the larger deletion. PAV recognizes these artifacts and refines alignments to obtain a more accurate representation of complex structures. The complex allele shown is HG00171 haplotype1–0000011. b) Fraction of all assemblies having complete and accurate sequence over the SMN region, stratified by study (HGSVC, HPRC-yr1). c) Copy number (full and partial gene alignments) of each multi-copy gene (*SMN1/2*- red, *SERF1A/B -* green, *NAIP* - gold, and *GTF2H2/C* - blue) across all human haplotypes (n=101). d) Visualization of DupMasker duplicons defined in 11 diverse human haplotypes spanning the SMN region. Panel depicts data from this study, the HPRC (HG02486), and one *Pongo pygmaeus* haplotype (top) used as an outgroup. e) Summary of *SMN1* (yellow) and *SMN2* (red) gene copies genotyped across human haplotypes (n=101). Yellow and red bars show a unique copy number of *SMN1* and *SMN2* while pie charts show proportions of continental groups carrying a given haplotype. Haplotypes that carry only the *SMN2* gene copy are highlighted by the asterisks. f) The amylase locus of the human genome is depicted. The H3r.4 haplotype represents the most common haplotype, H5.15 and H7.2 are haplotypes previously unresolved at the base-pair level, and H11.1 is a novel, previously undetected haplotype. Amylase gene annotations are displayed above each haplotype structure. The structure of each amylase haplotype, composed of amylase segments, is indicated by colored arrows. Sequence similarity between haplotypes ranges from 99% to 100%. The alignments highlight differences between the amylase haplotypes.

**Figure 6.. Variation in the sequence, structure, and methylation pattern among 1,246 human centromeres.**
a) Length of the ɑ-satellite higher-order repeat (HOR) array(s) for each complete and accurately assembled centromere from each genome. Each data point indicates an active ɑ-satellite HOR array and is colored by population. The median length of all α-satellite HOR arrays is shown as a dashed line. For each chromosome, the median (solid line) and first and third quartiles (dashed lines) are shown. b) Sequence, structure, and methylation map of centromeres from the CHM13, CHM1, and a subset of 65 diverse human genomes. The α-satellite HORs are colored by the number of α-satellite monomers within them, and the site of the putative kinetochore, known as the “centromere dip region” or “CDR”, is shown. c) Differences in the ɑ-satellite HOR array organization and methylation patterns between the CHM13 and HG00513 (H1) chromosome 10 centromeres. The CDRs are located on highly identical sequences in both centromeres, despite their differing locations. d) Mobile element insertions (MEIs) in the chromosome 2 centromeric α-satellite HOR array. Most MEIs are consistent with duplications of the same element rather than distinct insertions, and all of them reside outside of the CDR.

See this image and copyright information in PMC

References

1. Liao W-W, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, Lu S, Lucas JK, Monlong J, Abel HJ, Buonaiuto S, Chang XH, Cheng H, Chu J, Colonna V, Eizenga JM, Feng X, Fischer C, Fulton RS, Garg S, Groza C, Guarracino A, Harvey WT, Heumos S, Howe K, Jain M, Lu T-Y, Markello C, Martin FJ, Mitchell MW, Munson KM, Mwaniki MN, Novak AM, Olsen HE, Pesout T, Porubsky D, Prins P, Sibbesen JA, Sirén J, Tomlinson C, Villani F, Vollger MR, Antonacci-Fulton LL, Baid G, Baker CA, Belyaeva A, Billis K, Carroll A, Chang P-C, Cody S, Cook DE, Cook-Deegan RM, Cornejo OE, Diekhans M, Ebert P, Fairley S, Fedrigo O, Felsenfeld AL, Formenti G, Frankish A, Gao Y, Garrison NA, Giron CG, Green RE, Haggerty L, Hoekzema K, Hourlier T, Ji HP, Kenny EE, Koenig BA, Kolesnikov A, Korbel JO, Kordosky J, Koren S, Lee H, Lewis AP, Magalhães H, Marco-Sola S, Marijon P, McCartney A, McDaniel J, Mountcastle J, Nattestad M, Nurk S, Olson ND, Popejoy AB, Puiu D, Rautiainen M, Regier AA, Rhie A, Sacco S, Sanders AD, Schneider VA, Schultz BI, Shafin K, Smith MW, Sofia HJ, Abou Tayoun AN, et al. A draft human pangenome reference. Nature. 2023;617:312–324. - PMC - PubMed
1. Porubsky D, Vollger MR, Harvey WT, Rozanski AN, Ebert P, Hickey G, Hasenfeld P, Sanders AD, Stober C, Human Pangenome Reference Consortium, Korbel JO, Paten B, Marschall T, Eichler EE. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res. 2023;33:496–510. - PMC - PubMed
1. Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Korbel JO, Eichler EE, Zody MC, Dilthey AT, Marschall T. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat Genet. 2022;54:518–525. - PMC - PubMed
1. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–175. - PMC - PubMed
1. Jarvis ED, Formenti G, Rhie A, Guarracino A, Yang C, Wood J, Tracey A, Thibaud-Nissen F, Vollger MR, Porubsky D, Cheng H, Asri M, Logsdon GA, Carnevali P, Chaisson MJP, Chin C-S, Cody S, Collins J, Ebert P, Escalona M, Fedrigo O, Fulton RS, Fulton LL, Garg S, Gerton JL, Ghurye J, Granat A, Green RE, Harvey W, Hasenfeld P, Hastie A, Haukness M, Jaeger EB, Jain M, Kirsche M, Kolmogorov M, Korbel JO, Koren S, Korlach J, Lee J, Li D, Lindsay T, Lucas J, Luo F, Marschall T, Mitchell MW, McDaniel J, Nie F, Olsen HE, Olson ND, Pesout T, Potapova T, Puiu D, Regier A, Ruan J, Salzberg SL, Sanders AD, Schatz MC, Schmitt A, Schneider VA, Selvaraj S, Shafin K, Shumate A, Stitziel NO, Stober C, Torrance J, Wagner J, Wang J, Wenger A, Xiao C, Zimin AV, Zhang G, Wang T, Li H, Garrison E, Haussler D, Hall I, Zook JM, Eichler EE, Phillippy AM, Paten B, Howe K, Miga KH, Human Pangenome Reference Consortium. Semi-automated assembly of high-quality diploid human reference genomes. Nature. 2022;611:519–531. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Complex genetic variation in nearly complete human genomes

Affiliations

Complex genetic variation in nearly complete human genomes

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous