Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Aug 5:2024.08.05.606142.
doi: 10.1101/2024.08.05.606142.

A familial, telomere-to-telomere reference for human de novo mutation and recombination from a four-generation pedigree

Affiliations

A familial, telomere-to-telomere reference for human de novo mutation and recombination from a four-generation pedigree

David Porubsky et al. bioRxiv. .

Abstract

Using five complementary short- and long-read sequencing technologies, we phased and assembled >95% of each diploid human genome in a four-generation, 28-member family (CEPH 1463) allowing us to systematically assess de novo mutations (DNMs) and recombination. From this family, we estimate an average of 192 DNMs per generation, including 75.5 de novo single-nucleotide variants (SNVs), 7.4 non-tandem repeat indels, 79.6 de novo indels or structural variants (SVs) originating from tandem repeats, 7.7 centromeric de novo SVs and SNVs, and 12.4 de novo Y chromosome events per generation. STRs and VNTRs are the most mutable with 32 loci exhibiting recurrent mutation through the generations. We accurately assemble 288 centromeres and six Y chromosomes across the generations, documenting de novo SVs, and demonstrate that the DNM rate varies by an order of magnitude depending on repeat content, length, and sequence identity. We show a strong paternal bias (75-81%) for all forms of germline DNM, yet we estimate that 17% of de novo SNVs are postzygotic in origin with no paternal bias. We place all this variation in the context of a high-resolution recombination map (~3.5 kbp breakpoint resolution). We observe a strong maternal recombination bias (1.36 maternal:paternal ratio) with a consistent reduction in the number of crossovers with increasing paternal (r=0.85) and maternal (r=0.65) age. However, we observe no correlation between meiotic crossover locations and de novo SVs, arguing against non-allelic homologous recombination as a predominant mechanism. The use of multiple orthogonal technologies, near-telomere-to-telomere phased genome assemblies, and a multi-generation family to assess transmission has created the most comprehensive, publicly available "truth set" of all classes of genomic variants. The resource can be used to test and benchmark new algorithms and technologies to understand the most fundamental processes underlying human genetic variation.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. C.Lee is an SAB member of Nabsys and Genome Insight. D.P. has previously disclosed a patent application (no. EP19169090) relevant to Strand-seq. Z.K., C.N., E.D., C.F., C.Lambert, T.M., W.J.R., and M.A.E. are employees and shareholders of PacBio. Z.K. is a private shareholder in Phase Genomics. The other authors declare no competing interests.

Figures

Extended Data Figure 1.
Extended Data Figure 1.. Long-read sequencing and assembly contiguity.
a) Scatterplot of sequence read depth and read length N50 for ONT (blue) and PacBio (PB; magenta) with median coverage (dashed line) and different generations indicated (point shape). b) Scatterplot of the assembly contiguity measured in AuN values for Verkko (brown), hifiasm (UL) (light blue), and hifiasm (light gray) assemblies of G1-G4. Note: G4 samples were assembled using PacBio HiFi data (hifiasm) only. c) Top: Total number of Verkko contigs whose maximum aligned bases are within +/−5% of the total T2T-CHM13 chromosome length. *Due to substantial size differences between the T2T-CHM13 Y (haplogroup J1a-L816) and the Y chromosome of this pedigree (haplogroup R1b1a-Z302), three contigs are shown that span the entire male-specific Y region without breaks (i.e., excluding the pseudoautosomal regions). Bottom: Each dot represents a single Verkko contig with the highest number of aligned bases in a given chromosome. d) Chromosomes containing complete telomeres and being spanned by a single contig are annotated as solid squares. In instances where the p- and q-arms are not continuously assembled and for acrocentric chromosomes, we plot diagonally divided and color-coded triangles. e) Evaluation of centromere completeness across G1-G3 assemblies and across all chromosomes. We mark centromeres assembled by Verkko (brown), hifiasm (UL) (light blue), or both (green).
Extended Data Figure 2.
Extended Data Figure 2.. Recombination breakpoint map of CEPH 1463.
a) Depiction of intergenerational (G1->G4) inheritance of a 1 Mbp assembled contig. Alignments transmitted between generations that are >99.99% identical (red) are contrasted with non-transmitted with lower sequence identity (gray). b) T2T recombination between child and parental haplotypes for chromosome 8. Alignments between parental and a child’s haplotypes are binned into 500 kbp long bins and colored based on the percentage of matched bases. Inherited maternal (shades of red) and paternal (shades of blue) segments are marked on top. Dashed arrows show zoom-in of the two recombination breakpoints that differ in size of the region of homology at the recombination breakpoint. Black tick marks show positions of mismatches between parental and child haplotypes. c) Summary of recombination breakpoints detected in inherited maternal (red) and paternal (blue) homologs with respect to T2T-CHM13. d) Distribution of distances of maternal (red) and paternal (blue) recombination breakpoints to chromosome ends. e) Correlation between the number of recombination breaks (y-axis) and parental age (x-axis) shown separately for maternal (red) and paternal (blue) recombination breakpoints.
Extended Data Figure 3.
Extended Data Figure 3.. Number of germline and postzygotic SNVs transmitted to children.
a) The fraction of a parent’s germline SNVs (green, DNMs) and postzygotic SNVs (purple, PZMs) transferred to each child. b) The mean allele balance (AB) of DNMs and PZMs across HiFi, Illumina, and ONT data plotted against the fraction of children who inherited a variant reveals that about half the PZMs with AB < 0.25 get transmitted to at least one child. c) On average, DNMs are transmitted to 50% of children, while PZMs are transmitted to less than 25% of children. d) Number of DNMs and PZMs transmitted to each child in the pedigree.
Extended Data Figure 4.
Extended Data Figure 4.. Changes in centromere sequence, structure, and DNA methylation patterns across generations.
a) Deletion of an 18-monomer α-satellite HOR within the chromosome 6 centromere of G2-NA12878 is inherited in G3-NA12887, shortening the length of the α-satellite HOR array by ~3 kbp. b) Sequence identity heatmap of the chromosome 6 centromere in G1-NA128991 shows the high (~100%) sequence identity of α-satellite HORs along the entire centromeric array and at the site of the de novo deletion. c,d) Deletions of α-satellite HORs in regions outside of the centromere dip region (CDR) in the c) chromosome 4 and d) chromosome 11 centromeres does not affect the position of the CDR. e,f) Deletions and insertions of α-satellite HORs within the CDR in the e) chromosome 19 and f) chromosome 21 centromeres alter the distribution of the CDR.
Figure 1.
Figure 1.. Sequencing the CEPH 1463 pedigree with five technologies.
Twenty-eight members of the four-generation CEPH pedigree (1463) were sequenced using five orthogonal next-generation and long-read sequencing platforms: HiFi sequencing, Illumina, and Element sequencing for generations 2–4 (G2-G4) were performed on peripheral blood, while UL-ONT and Strand-seq were generated on available lymphoblastoid cell lines (G1-G3). The pedigree dataset has been expanded, for the first time, to include the fourth generation and G3 spouses (NA12879 and NA12886).
Figure 2.
Figure 2.. Summary of de novo mutation (DNM) rates.
a) The number of de novo germline/postzygotic mutations (PZMs) and indels (<50 bp) for the parents (G2) and 8 children in CEPH 1463. Tandem repeat de novo mutations (TR DNMs) (<50 bp) are shown for G3 only because they have greater parental sequencing depth and we can assess transmission (Methods). Crosshatch bars are the number of SNVs confirmed as transmitting to the next generation. b) Germline SNVs have a mean allele balance near 0.50 across sequencing platforms, while the mean postzygotic allele balance is less than 0.25. c) A strong paternal age effect is observed for germline de novo SNVs but not for PZMs. d) Estimated SNV DNM rate by region of the genome shows a significant excess of DNM for large repeat regions, including centromeres and segmental duplications. Assembly-based DNM calls on the centromeres and Y chromosome show an excess of DNM in the satellite DNA.
Figure 3.
Figure 3.. Tandem repeat de novo mutations (TR DNMs) show motif size dependent mutation rates, paternal bias, and are highly recurrent at specific loci.
a) TR DNM rates (mutations per haplotype, per locus, per generation) as a function of TR motif size in the T2T-CHM13 reference genome. Complex TR loci that comprise more than one unique motif were excluded. Error bars denote 95% Poisson confidence intervals around the mean mutation rate estimate. Mutation rates include all calls that pass TRGT-denovo filtering criteria but are not adjusted for Element validation. b) Inferred parent-of-origin for confidently phased TR DNMs in G3. Crosshatches indicate transmission to at least one G4 child, where available. c) Pedigree overview of a recurrent VNTR locus at chr8:2376919–2377075 (T2T-CHM13) with motif composition GAGGCGCCAGGAGAGAGCGCT(n)ACGGG(n). Allele coloring indicates inheritance patterns as determined by inheritance vectors, gray representing unavailable data. Symbols denote inheritance type relative to the inherited parental allele: “+” for de novo expansion, “−” for de novo contraction, and “=“ for regular inheritance, shown only for the mutating alleles, and numbers indicate allele lengths in bp. d) Read-level evidence for the recurrent DNM in (c), represented as vertical lines, obtained from individual sequencing reads, shown per sample. Where available, both HiFi (top) and ONT (bottom) sequencing reads are displayed. Coloring is consistent with inheritance patterns in (c); outlined boxes with +/− markers highlight DNMs.
Figure 4.
Figure 4.. De novo SVs among centromeres transmitted across generations.
a) Plot summarizing the number of correctly assembled centromeres (dark gray) as well as those transmitted to the next generation (light gray). Transmitted centromeres that carry a de novo deletion, insertion, or both are colored (see legend). b) Lengths of the de novo SVs within α-satellite HOR arrays and flanking regions. c) An example of a de novo deletion in the chromosome 6 α-satellite HOR array in G2-NA12878 that was inherited in G3-NA12887. Red arrows over each haplotype show the α-satellite HOR structure, while gray blocks between haplotypes show syntenic regions. The deleted region is highlighted by a red outline. d) An example of a de novo insertion and deletion in the chromosome 19 α-satellite HOR array of G3-NA12887. e-f) Zoom-in of the α-satellite HOR structure of the inserted (blue outline) and deleted (red outline) α-satellite HORs from (d). Again, colored arrows on top of each haplotype show the α-satellite HOR structure. g) Example of two de novo deletions in the chromosome 21 centromere of G2-NA12877. The deletions reside within a hypomethylated region of the centromeric α-satellite HOR array, known as the “centromere dip region” (CDR), which is thought to be the site of kinetochore assembly. The deletion of three α-satellite HORs within the CDR results in a shift of the CDR by ~260 kbp in G2-NA12877.
Figure 5.
Figure 5.. Chromosome Y and an example of a de novo mobile element.
a) Pedigree of the nine males carrying the R1b1a-Z302 Y chromosomes (left) and pairwise comparison of Y assemblies: closely related Y from HG00731 (R1b1a-Z225) and the most contiguous R1b1a-Z302 Y assemblies from three generations. Y-chromosomal sequence classes are shown with pairwise sequence identity between samples in 100 kbp bins, with QC-passed SVs identified in the pedigree males shown. b) Summary of chrY DNMs. Top - Y structure of G1-NA12889. Below the Y structure - all identified DNMs across G1-G3 Y assemblies. Bottom - breakout by mutation class and by sample. In light gray are DNMs that show evidence of transmission from G2 to G3-G4, and from G3-NA12886 to his male descendants in G4. c) De novo SVA insertion in G3-NA12887. d) HiFi read support for the de novo SVA insertion in G3-NA12887.

References

    1. Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). - PMC - PubMed
    1. Altemose N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022). - PMC - PubMed
    1. Vollger M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022). - PMC - PubMed
    1. Guarracino A. et al. Recombination between heterologous human acrocentric chromosomes. Nature 617, 335–343 (2023). - PMC - PubMed
    1. Miga K. H. & Eichler E. E. Envisioning a new era: Complete genetic information from routine, telomere-to-telomere genomes. Am. J. Hum. Genet. 110, 1832–1840 (2023). - PMC - PubMed

Publication types

LinkOut - more resources