Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May;593(7857):101-107.
doi: 10.1038/s41586-021-03420-7. Epub 2021 Apr 7.

The structure, function and evolution of a complete human chromosome 8

Affiliations

The structure, function and evolution of a complete human chromosome 8

Glennis A Logsdon et al. Nature. 2021 May.

Abstract

The complete assembly of each human chromosome is essential for understanding human biology and evolution1,2. Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the β-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Telomere-to-telomere assembly of human chromosome 8.
a, Gaps in the GRCh38 chromosome 8 reference sequence. b, Targeted assembly method to resolve complex repeat regions in the human genome. Ultra-long ONT reads (grey) are barcoded with SUNKs (coloured bars) and assembled into a sequence scaffold. Regions within the scaffold sharing high sequence identity with PacBio HiFi contigs (dark grey) are replaced, improving the base accuracy to greater than 99.99%. The PacBio HiFi assembly is integrated into an assembly of CHM13 chromosome 8 (ref. ) and validated. c, Sequence, structure, methylation status and genetic composition of the CHM13 β-defensin locus. The locus contains three segmental duplications (dups) at chr8:7098892–7643091, chr8:11528114–12220905 and chr8:12233870–12878079. A 4,110,038-bp inversion (chr8:7500325–11610363) separates the first and second duplications. Iso-Seq data reveal that the third duplication (light blue) contains 12 new protein-coding genes, five of which are DEFB genes (Extended Data Fig. 3g). d, Copy number of the DEFB genes (chr8:7783837−7929198 in GRCh38) throughout the human population, determined from a collection of 1,105 high-coverage genomes (Methods). Data are median ± s.d.
Fig. 2
Fig. 2. Sequence, structure and epigenetic map of the chromosome 8 centromeric region.
a, Schematic showing the composition of the CHM13 chromosome 8 centromere. The centromeric region consists of a 2.08-Mb D8Z2 α-satellite HOR array flanked by regions of monomeric and/or divergent α-satellite interspersed with retrotransposons, β-satellite and γ-satellite. The predicted restriction digest pattern is shown. The D8Z2 α-satellite HOR array is heavily methylated except for a 73-kb hypomethylated region, which is contained within a 632-kb CENP-A chromatin domain (Extended Data Fig. 9, Supplementary Fig. 8). A pairwise sequence identity heat map indicates that the centromere is composed of five distinct evolutionary layers (dashed arrows). b, Pulsed-field gel Southern blot of CHM13 DNA confirms the structure and organization of the chromosome 8 centromeric HOR array. Left, ethidium bromide (EtBr) staining; right, 32P-labelled chromosome 8 α-satellite-specific probe. n = 2. See Supplementary Fig. 9a, b for gel source data. c, Representative images of a CHM13 chromatin fibre showing CENP-A enrichment in an unmethylated region. n = 3. Scale bar, 1 μm.
Fig. 3
Fig. 3. Sequence and structure of the chimpanzee, orangutan, and macaque chromosome 8 centromeres.
ad, Structure and sequence identity of the chimpanzee (H1) (a), chimpanzee (H2) (b), orangutan (c) and macaque (d) chromosome 8 centromeres. Each centromere has a mirrored organization consisting of four or five distinct evolutionary layers. The size of each centromeric region is consistent with microscopic analyses, showing increasingly bright DAPI staining with increasing centromere size. See Supplementary Figs. 10 and 11 for sequence identity heat maps plotted on the same colour scale. H1, haplotype 1; H2, haplotype 2. Scale bar, 1 μm.
Fig. 4
Fig. 4. Evolution of the chromosome 8 centromere.
a, Phylogenetic tree of human, chimpanzee, orangutan and macaque α-satellites from the chromosome 8 centromeric regions (Supplementary Fig. 6a, b). b, Plot showing the sequence divergence between CHM13 and nonhuman primates in the regions flanking the chromosome 8 α-satellite HOR array. See Supplementary Fig. 6d for a model of centromere evolution.
Extended Data Fig. 1
Extended Data Fig. 1. Sequence, structure and epigenetic map of the neocentromeric chromosome 8q21.2 VNTR.
a, Schematic showing the composition of the CHM13 8q21.2 VNTR. This VNTR consists of 67 full and 7 partial 12.192-kb repeats that span 863 kb in total. The predicted restriction digest pattern is indicated. Each repeat is methylated within a 3-kb region and hypomethylated within the rest of the sequence. Mapping of CENP-A ChIP–seq data from the chromosome 8 neodicentric cell line known as MS4221, (Methods) reveals that approximately 98% of CENP-A chromatin is located within the hypomethylated portion of the repeat. A pairwise sequence identity heat map across the region indicates a mirrored symmetry within a single layer, consistent with the evolutionarily young status of the tandem repeat. b, Pulsed-field gel Southern blot of CHM13 DNA digested with BmgBI confirms the size and organization of the chromosome 8q21.2 VNTR. Left, ethidium bromide staining; right, 32P-labelled chromosome 8q21.2-specific probe. For gel source data, see Supplementary Fig. 1c, d. c, Copy number of the 8q21 repeat (chr8:85792897−85805090 in GRCh38) throughout the human population. CHM13 is estimated to have 144 total copies of the 8q21 repeat, or 72 copies per haplotype, whereas GRCh38 only has 26 copies (red data points). Median ± s.d. is shown.
Extended Data Fig. 2
Extended Data Fig. 2. CHM13 chromosome 8 telomeres.
a, Schematic showing the first and last megabase of the CHM13 chromosome 8 assembly. A dot plot of the terminal 5 kb shows high sequence identity among the last approximately 2.5 kb of the chromosome, consistent with the presence of a high-identity telomeric repeating unit. b, c, Number of TTAGGG telomeric repeats in the last 5 kb of the p-arm (b) and q-arm (c) in chromosome 8. The p-arm has a gradual transition to pure TTAGGG repeats over nearly 1 kb, whereas the q-arm has a very sharp transition to pure TTAGGG repeats that occurs over nearly 300 bp.
Extended Data Fig. 3
Extended Data Fig. 3. Genes with improved alignment to the CHM13 chromosome 8 assembly relative to GRCh38.
a, Ideogram of chromosome 8 showing protein-coding genes with improved transcript alignments to the CHM13 chromosome 8 assembly relative to GRCh38 (hg38). Each gene is labelled with its name, count of improved transcripts from the CHM13 cell line, count of improved transcripts from other tissues, the average percent improvement of non-CHM13 cell line alignments, and the number of tissue sources with improved transcript mappings. b, c, Differential percentage sequence identity of transcripts aligning to CHM13 or GRCh38 for CHM13 cell line transcripts (b) and non-CHM13 cell line transcripts (c). df, Multiple-sequence alignments for WDYHV1 (d), MCPH1 (e) and PCMTD1 (f), all of which have at least 0.1% greater sequence identity of >20 full-length Iso-Seq transcripts to the CHM13 chromosome 8 assembly than to GRCh38 (Methods). For each gene, the GRCh38 annotation is compared to the same annotation lifted over to the CHM13 chromosome 8 assembly, and the substitutions are confirmed by translated predicted open reading frames from Iso-Seq transcripts. Matching amino acids are shaded in grey, those matching only the Iso-Seq data are in red, and those different from the Iso-Seq data are in blue. Each substitution in CHM13 relative to GRCh38 has an allele frequency of 0.36 in gnomAD (v3). g, Location of DEFA and DEFB genes in the CHM13 chromosome 8 β-defensin locus. Segmental duplication regions were identified by SEDEF, and new paralogues are shown in red. Duplication cassettes are marked with arrows indicating orientation for each copy.
Extended Data Fig. 4
Extended Data Fig. 4. Comparison of the CHM13 and GRCh38 β-defensin loci.
Miropeats comparison of the CHM13 and GRCh38 β-defensin loci identifies a 4.11-Mb inverted region (dashed grey line) bracketed by proximal and distal segmental duplications (dup; black and blue arrows) in CHM13. CHM13 also has an additional segmental duplication (blue arrow) relative to the GRCh38. In total, the CHM13 haplotype adds 611.9 kb of new sequence, of which 602.6 kb is located within segmental duplications and 9.3 kb is located at the distal edge of the inverted region. Coloured segments track blocks of homology between CHM13 and GRCh38.
Extended Data Fig. 5
Extended Data Fig. 5. Validation of the CHM13 β-defensin locus, and copy number of the DEFA gene family.
a, Coverage of CHM13 ONT and PacBio HiFi data along the CHM13 β-defensin locus (top two panels). The ONT and PacBio data have largely uniform coverage, indicating it is free of large structural errors. The dip in HiFi coverage near position 10.46 Mb is due to a G/A bias in HiFi chemistry. The alignment of 47 CHM13 BACs (bottom) reveals that those regions have an estimated quality value score >25 (>99.7% accurate). b, Copy number of DEFA (chr8:6976264−6995380 in GRCh38 (hg38)) throughout the human population. Median ± s.d. is shown.
Extended Data Fig. 6
Extended Data Fig. 6. Validation of the CHM13 chromosome 8 centromeric region.
a, Coverage of CHM13 ONT and PacBio HiFi data along the CHM13 chromosome 8 centromeric region (top two panels) is largely uniform, indicating a lack of large structural errors. Analysis with TandemMapper and TandemQUAST, which are tools that assess repeat structure via mapped reads (third panel) and misassembly breakpoints (fourth panel; red), indicates that the chromosome 8 D8Z2 α-satellite HOR array lacks large-scale assembly errors. Five different FISH probes targeting regions in the chromosome 8 centromeric region (bottom) are used to confirm the organization of the α-satellite DNA (b, c). b, c, Representative images of metaphase chromosome spreads hybridized with FISH probes targeting regions within the chromosome 8 centromere (a). Insets show both chromosome 8s with the predicted organization of the centromeric region. d, Droplet digital PCR of the chromosome 8 D8Z2 α-satellite array indicates that there are 1,344 ± 142 D8Z2 HORs present on chromosome 8, consistent with the predictions from an in silico restriction digest and StringDecomposer analysis (Methods). Mean ± s.d. is shown. Scale bar, 5 μm. Insets, 2.5× magnification.
Extended Data Fig. 7
Extended Data Fig. 7. Sequence, structure and epigenetic map of human diploid HG00733 chromosome 8 centromeres.
a, b, Repeat structure, α-satellite organization, methylation status and sequence identity heat map of the maternal (a) and paternal (b) chromosome 8 centromeric regions from a diploid human genome (HG00733; Supplementary Table 2) shows structural and epigenetic similarity to the CHM13 chromosome 8 centromeric region (Fig. 2a). ce, Dot plot comparisons between the CHM13 and maternal (c), CHM13 and paternal (d), and maternal and paternal (e) chromosome 8 centromeric regions in the HG00733 genome show more than 99% sequence identity overall, with high concordance in the unique and monomeric α-satellite regions of the centromeres (dark red line) that devolves into lower sequence identity in the α-satellite HOR array, consistent with rapid evolution of this region.
Extended Data Fig. 8
Extended Data Fig. 8. Composition, organization and entropy of the CHM13 D8Z2 α-satellite HOR array.
a, HOR composition and organization of the chromosome 8 α-satellite array as determined via StringDecomposer. The predominant HOR subtypes (4-, 7-, 8- and 11-monomer HORs) are shown, whereas those occurring less than 15 times are not (see Methods for absolute quantification). The entropy of the D8Z2 HOR array is plotted in the bottom panel and reveals that the hypomethylated and CENP-A-enriched regions have the highest consistent entropy in the entire array. b, Organization of α-satellite monomers within each HOR. The initial monomer of the 4- and 7-monomer HORs is a hybrid of the A and E monomers, with the first 87 bp the A monomer and the subsequent 84 bp the E monomer. c, Abundance of the predominant HOR types within the D8Z2 HOR array as determined via StringDecomposer.
Extended Data Fig. 9
Extended Data Fig. 9. Location of CENP-A chromatin within the CHM13 D8Z2 α-satellite HOR array.
a, b, Plot of the ratio of CENP-A ChIP to bulk nucleosome reads mapped via BWA-MEM (a), or the number of k-mer-mapped CENP-A ChIPs (black) or bulk nucleosome (dark grey) reads (b) (Methods). Shown are two independent replicates of CENP-A ChIP–seq performed on CHM13 cells (top two panels), as well as single replicates of CENP-A ChIP–seq performed on human diploid neocentromeric cell lines (bottom two panels; Methods). Although the neocentromeric cell lines have a neocentromere located on either chromosome 13 (IMS13q) or 8 (MS4221),, they both have at least one karyotypically normal chromosome 8 from which centromeric chromatin can be mapped. We limited our analysis to diploid cell lines rather than aneuploid ones to avoid potentially confounding results stemming from multiple chromosome 8 copies that vary in structure, such as those observed in HeLa cells.
Extended Data Fig. 10
Extended Data Fig. 10. Validation of the CHM13 8q21.2 VNTR.
a, Coverage of CHM13 ONT and PacBio HiFi data along the 8q21.2 VNTR (top two panels) is largely uniform, indicating a lack of large structural errors. Two FISH probes targeting the 12.192-kb repeat in the 8q21.2 VNTR are used to estimate the number of repeats in the CHM13 genome (b, c). b, Representative FISH images of a CHM13 stretched chromatin fibre. Although the FISH probes were designed against the entire VNTR array, stringent washing during FISH produces a punctate probe signal pattern, which may be due to stronger hybridization of the probe to a specific region in the 12.192-kb repeat (perhaps based on GC content or a lack of secondary structures). This punctate pattern can be used to estimate the repeat copy number in the VNTR, thereby serving as a source of validation. c, Plot of the signal intensity on the CHM13 chromatin fibre shown in b. Quantification of peaks across three independent experiments reveals an average of 63 ± 7.55 peaks and 67 ± 5.20 peaks (mean ± s.d.) from the green and red probes, respectively, which is consistent with the number of repeat units in the 8q21.2 assembly (67 full and 7 partial repeats). Scale bar, 5 μm.

References

    1. International Human Genome Project Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. - DOI - PubMed
    1. Alkan C, et al. Genome-wide characterization of centromeric satellites from multiple mammalian genomes. Genome Res. 2011;21:137–145. doi: 10.1101/gr.111278.110. - DOI - PMC - PubMed
    1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. - DOI - PubMed
    1. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. gr.263566.120 (2020). - PMC - PubMed

Publication types

MeSH terms