Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Mar 7:2023.03.07.531415.
doi: 10.1101/2023.03.07.531415.

Structurally divergent and recurrently mutated regions of primate genomes

Affiliations

Structurally divergent and recurrently mutated regions of primate genomes

Yafei Mao et al. bioRxiv. .

Update in

  • Structurally divergent and recurrently mutated regions of primate genomes.
    Mao Y, Harvey WT, Porubsky D, Munson KM, Hoekzema K, Lewis AP, Audano PA, Rozanski A, Yang X, Zhang S, Yoo D, Gordon DS, Fair T, Wei X, Logsdon GA, Haukness M, Dishuck PC, Jeong H, Del Rosario R, Bauer VL, Fattor WT, Wilkerson GK, Mao Y, Shi Y, Sun Q, Lu Q, Paten B, Bakken TE, Pollen AA, Feng G, Sawyer SL, Warren WC, Carbone L, Eichler EE. Mao Y, et al. Cell. 2024 Mar 14;187(6):1547-1562.e13. doi: 10.1016/j.cell.2024.01.052. Epub 2024 Feb 29. Cell. 2024. PMID: 38428424 Free PMC article.

Abstract

To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.

PubMed Disclaimer

Conflict of interest statement

Competing interests E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. The other authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Primate phylogeny and SNV divergence between NHPs and humans.
(a) A primate time-calibrated phylogeny was constructed from a multiple sequence alignment (MSA) of 81.63 Mbp of autosomal sequence from nine genomes. The estimated species divergence time (above node) with 95% confidence interval (CI, horizontal blue bar) was calculated using BEAST2. All nodes have 100% posterior possibility support, and the gene tree concordance factor (gCF) is indicated (below node). The inset (gray) depicts a maximum likelihood phylogram generated using IQ-TREE2, which reveals a significantly shorter branch length in owl monkey, with respect to marmoset. (b) SNV divergence calculated by mapping HiFi sequence reads to human GRC38 separately for autosomes and the X chromosome (excluding pseudoautosomal regions). Approximately 85% of the genome was aligned for Old World monkey and apes and ~60% for New World monkey. The owl monkey shows significantly less divergence compared to human than the marmoset (Wilcoxon rank sum test). An analysis using 20 kbp nonoverlapping segments from the assembly gives almost identical results (Supplementary Figure 4). (c) The percent of trees showing an alternate tree topology are indicated (percentages are drawn from a total of 302,575 gene trees): 159,546 (52.7%) support the primate topology depicted in panel a.
Figure 2.
Figure 2.. Primate genome structural variation.
(a) The number of fixed structural variants (SVs) including deletions (red) and insertions (blue) are shown for each branch of the primate tree (number of events above the line and number of Mbp below). The number of “disrupted” protein-coding genes based on human RefSeq models are also indicated (black oval) with the total number of events (first number) and the subset specific to each lineage (second number). (b) The number of fixed SVs correlates with the accumulation of SNVs in each lineage (comparison to GRCh38) for both deletions (red) and insertions (blue). (c) An ape-specific fixed L1 insertion (shown with a red dashed line box) in the human genome but not in the macaque genome (Miropeats alignment) serves as an exapted exon of the short isoform of astrotactin 2, ASTN2, in human. The coding sequences of the exon are shown in the bottom panel. The red triangles represent 1 bp insertion resulting in a frameshift in gorilla, orangutan, and gibbon. The red box represents the stop codon. (d) A 42.7 kbp lineage-specific deletion in the gibbon genome (red dashed line) deletes TAAR2 and seven enhancers (shown in orange) compared to the human (GRCh38) (Miropeats comparison). (e) A 90 bp deletion (30 amino acids) human-specific deletion of NAT16 (NM_001369694) removes 30 amino acids in humans compared to all other NHPs.
Figure 3.
Figure 3.. Structurally divergent regions (SDRs) of the primate genome.
(a) A schematic of human chromosomes (T2T-CHM13) depicts SDR hotspots where recurrent rearrangements occur in excess. Heat map indicates significance based on simulation model (dark (p=0) to light red (p=0.05)). Centromeres are depicted in purple. Enumerated regions identify specific gene families or regions of biomedical interest (1: UPRT, 2: RGPDs, 3: USP41, 4: ZNFs, 5. IL3RA_2, 6: CARDs, 7: OLAH, and 8: MHC). (b) Recurrent deletion of the caspase recruitment domain (CARD) gene family. SafFire plot (https://github.com/mrvollger/SafFire) shows a ~58 kbp deletion of CARD18 (orange) in the Pan lineage, multiple deletions (~190 kbp total) in gibbon of CARD16 (blue), CARD17 (red) and CARD18, and multiple deletions ~150 kbp, including CARD17 (red), in marmoset. (c) SafFire plot of SDR mapping to genes OLAH, MEIG1, and ABCD7 in human shows a large ~250 kbp insertion of segmental duplications (SDs; colored arrowheads) in chimpanzee within the intergenic region between MEIG1 and OLAH. OLAH is deleted in gorilla by an independent lineage-specific deletion (~30 kbp). Multiple independent insertion events in macaque add ~190 kbp of sequence, including a duplication of OLAH in macaque. Full-length transcript sequencing of macaque using Iso-Seq supports the formation of five novel transcripts, including four OLAH-ABCD fusion events and a derived ABCD7 (macaque gene models below). (d) The chimpanzee-specific 250 kbp SD from chromosome 12 creates a novel multi-exonic gene model supported by Iso-Seq transcript sequencing in chimpanzee (upper panel) with an unmethylated promoter (Supplementary Figure 36). The insertion simultaneously deletes one of two directly orientated (DO) SDs in chimpanzee. (e) In humans, the DO repeats associate with the breakpoints of recurrent deletions and duplications of the spermiogenesis gene MEIG1. Two females carrying a deletion and a duplication (as measured by sequence read depth) are depicted from a population sample of 19,584 genomes (CCDG, https://ccdg.rutgers.edu/). The carrier frequencies for microdeletion and microduplication in control samples are 0.026% and 0.189%, respectively.
Figure 4.
Figure 4.. Marmoset-specific genes in a SDR.
(a) SafFire plot comparing the organization of a gene-rich region of ~1.1 Mbp in human (middle), owl monkey (top), and marmoset (bottom) genomes. Human and marmoset differ mainly by a large 250 kbp inversion (orange) associated with the addition of 150 kbp of SD at the boundary of the inversion in humans (colored arrowheads). The corresponding region in marmoset has expanded by ~400 kbp due to inversion and marmoset-specific SDs creating marmoset-specific paralogs (red arrows) of CCDC70, TMEM272, DHRS12, UTP14C, THSD1, VPS36, NEK5 and CKAP2. (b) Iso-Seq full-length non-chimeric transcript sequencing from 10 marmoset primary tissues confirms transcription of 8/10 of the paralogous copies and the maintenance of an open-reading frame in at least six of these marmoset-specific gene candidates.
Figure 5.
Figure 5.. Evolution, selection, and disease susceptibility of the RGPD gene family.
(a) Schematic depicting RGPD genes (red dots) compared to its progenitor gene RANBP2 (orange dot) in human, chimpanzee, gorilla, orangutan, and gibbon. Shared ancestral copies among the lineages are indicated (vertical arrows) in contrast to lineage-specific duplications (black) or gene conversion events (blue dashed arced arrow). The majority of copies have expanded in a lineage-specific fashion in each ape lineage. (b) A maximum likelihood tree based on a 58.98 kbp MSA of 40 RGPD great ape copies outgrouped with a sole gibbon copy. Nodes are dated with BEAST2 with the mean age of divergence shown above the node (95% CI blue bar) for human (H), bonobo (B), chimpanzee (C), gorilla (G), orangutan (O), and gibbon (Gib) copies. The analysis confirms lineage-specific expansion with all nodes receiving 100% posterior possibility. (c) A comparison of ~7 Mbp on chromosome 2 among ape genomes showing that large breakpoints in synteny (colored rectangles) often correspond to sites of RGPD SD insertions (blue arrows). (d) Human genetic diversity (pi) calculated in 20 kbp windows (slide 10 kbp) from 94 haplotype-resolved human genomes (HPRC) for a 700 kbp region of chromosome 2. A segment mapping to the human-specific gene RGPD1 shows the lowest genetic diversity on chromosome 2 (top panel, red arrow) in haplotypes of both African (red) and non-African (blue) descent. The data suggest that the RGPD1 region may have been under recent selection in the ancestral human population. (e) AlphaFold predictions of the protein N-terminus structure RANBP2 (blue), hRGPD1 (pink), and hRGPD2 (green) predict that differences in amino acid composition alter the secondary structure of two alpha helices (α1 and α2) in the human-specific RGPD1 copy. The X-ray crystal protein structure of hRANBP2 (Nup358, PDB: 4GA0) confirms that the α1 and α2 interface is maintained as a result of critical hydrophobic amino acids located in the N-terminus. Specific amino acid changes in hRGPD1 break the hydrophobic interface between α1 and α2 but not in the ancestral hRGPD2 or RANBP2 predicting the emergence of a human-specific protein structure. (f) SafFire plot (top panel) comparing the chimpanzee genome and human highlights the formation of a 350 kbp human-specific duplication creating RGPD6 (red shading). (g) Analysis of 94 human haplotypes shows that the RGPD6 locus is largely fixed among all humans but that the organization of the flanking SDs differs significantly. We identify 11 distinct structural haplotypes in the human population predicting both disease susceptibility as well as protective haplotypes for nonallelic homologous recombination (NAHR). NAHR between inverted repeats (large black arrows) predisposes to recurrent inversion of the region while NAHR between directly orientated repeats (red arrows) deletes the NPHP1 allele creating the pathogenic allele associated with juvenile nephronophthisis and milder forms of Joubert syndrome. This predisposition to disease, thus, arose as a result of the emergence of human-specific duplication of the RGPD gene family.

References

    1. Watson J. D. The human genome project: past, present, and future. Science 248, 44–49, (1990). - PubMed
    1. Lander E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921, (2001). - PubMed
    1. Venter J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). - PubMed
    1. Gibbs R. A. The Human Genome Project changed everything. Nature reviews. Genetics 21, 575–576, doi:10.1038/s41576-020-0275-3 (2020). - DOI - PMC - PubMed
    1. Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). - PMC - PubMed

Publication types