Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 14;187(6):1547-1562.e13.
doi: 10.1016/j.cell.2024.01.052. Epub 2024 Feb 29.

Structurally divergent and recurrently mutated regions of primate genomes

Affiliations

Structurally divergent and recurrently mutated regions of primate genomes

Yafei Mao et al. Cell. .

Abstract

We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or ∼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.

Keywords: NPHP1 and Joubert syndrome; RGPD gene family; adaptive evolution; comparative genomics; duplicated genes; evolutionary medicine; human diseases; long-read sequencing; primate evolution.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc.

Figures

Figure 1.
Figure 1.. Primate phylogeny and single-nucleotide variant (SNV) divergence between nonhuman primates (NHPs) and humans.
(a) A primate time-calibrated phylogeny was constructed from a multiple sequence alignment (MSA) of 81.63 Mbp of autosomal sequence from nine genomes. The estimated species divergence time (above node) with 95% confidence interval (CI, horizontal blue bar) was calculated using BEAST2. All nodes have 100% posterior possibility support, and the gene tree concordance factor is indicated (below node). The inset (gray) depicts a maximum likelihood phylogram, which reveals a significantly shorter branch length in owl monkey, with respect to marmoset. (b) SNV divergence calculated by mapping HiFi sequence reads to human GRC38 separately for autosomes and the X chromosome (excluding pseudoautosomal regions). Owl monkey shows significantly less divergence compared to human than the marmoset (Wilcoxon rank sum test). (c) The percent of trees showing an alternate tree topology are indicated (percentages are drawn from a total of 302,575 gene trees): 159,546 (52.7%) support the primate topology depicted in panel a.
Figure 2.
Figure 2.. Primate genome structural variation.
(a) The number of fixed structural variants (SVs) including deletions (red) and insertions (blue) are shown for each branch of the primate tree (number of events above the line and number of Mbp below). The number of “disrupted” protein-coding genes based on human RefSeq models are also indicated (black oval) with the total number of events (first number) and the subset specific to each lineage (second number). The numbers assigned to the ancestral branches refer to the ancestral lineage-specific SVs (e.g., Pan lineage (bonobo and chimpanzee ancestry shared)). (b) The number of fixed SVs correlates with the accumulation of SNVs in each lineage (comparison to GRCh38) for both deletions (red) and insertions (blue). (c) An ape-specific fixed L1 insertion (a red dashed line box) in the human genome but not in the macaque genome serves as an exapted exon of the short isoform of astrotactin 2, ASTN2, in human. (d) A 42.7 kbp lineage-specific deletion in the gibbon genome (red dashed line) deletes TAAR2 and seven enhancers (shown in orange) compared to the human (GRCh38). (e) A 90 bp deletion (30 amino acids) human-specific deletion of NAT16 (NM_001369694) removes 30 amino acids in humans compared to all other NHPs.
Figure 3.
Figure 3.. Structurally divergent regions (SDRs) of the primate genome.
(a) A schematic of human chromosomes (T2T-CHM13) depicts SDR hotspots where recurrent rearrangements occur in excess. Heatmap indicates significance based on simulation model (dark (p=0) to light red (p=0.05)). Centromeres are depicted in purple. Enumerated regions identify specific gene families or regions of biomedical interest (1: UPRT, 2: RGPDs, 3: USP41, 4: ZNFs, 5. IL3RA_2, 6: CARDs, 7: OLAH, and 8: MHC). (b) Recurrent deletion of the caspase recruitment domain (CARD) gene family. (c) SafFire plot of SDRs mapping to genes OLAH, MEIG1, and ABCD7 in human shows a large ~250 kbp insertion of segmental duplications (SDs; colored arrowheads) in chimpanzee within the intergenic region between MEIG1 and OLAH. Full-length transcript sequencing of macaque using Iso-Seq supports the formation of five alternate transcripts, including four OLAH-ABCD fusion events and a derived ABCD7 (macaque gene models below). (d) The chimpanzee-specific 250 kbp SD from chromosome 12 creates a multi-exonic gene model supported by Iso-Seq transcript sequencing in chimpanzee (upper panel) with an unmethylated promoter (Data S1). (e) In humans, the directly oriented (DO) repeats associate with the breakpoints of recurrent deletions and duplications of the spermiogenesis gene MEIG1. Two females carrying a deletion and a duplication are depicted from a population sample of 19,584 genomes (CCDG, https://ccdg.rutgers.edu/). (f) C4 haplotypes in primates. Nine human haplotypes from 94 HPRC haplotypes and other NHP haplotypes from this study are shown in the left panel. The percentage of nine human haplotypes is shown in the right panel. The mean ages of nodes are shown, and the horizontal bars represent 95% CI of node ages. (h) Human gene C4 under positive selection. The probability of sites under positive selection is shown on top based on the PAML branch-site model: >90% probability (orange) and >80% probability (blue). The corresponding amino acid alignment in primates is shown in the bottom panel. The amino acids under positive selection with at least 50% probability are indicated (colored dots and red boxes).
Figure 4.
Figure 4.. Marmoset-specific genes in an SDR.
(a) SafFire plot comparing the organization of a gene-rich region of ~1.1 Mbp in human (middle), owl monkey (top), and marmoset (bottom) genomes. (b) Iso-Seq full-length non-chimeric transcript sequencing from 10 marmoset primary tissues confirms transcription of 8/10 of the paralogous copies and the maintenance of an open-reading frame in at least six of these marmoset-specific gene candidates. Cell types identified from human (c) and marmoset (d) cortex single-nuclei RNA-seq. The tSNE plots show the cell type identified by specific markers (Ex: excitatory, In: inhibitory, Oligo: oligodendrocytes, Astro: astrocytes, Micro: microglia, OPC: Oligodendrocyte precursor cells). The tSNE plots for human VPS36 (e) and marmoset VPS36_ori (f) show the increased proportion of neuroglial cells expressing VPS36 paralogs in marmosets compared to humans. (g) The proportion of neuroglial cells expressing VPS36 in human and expressing VPS36, VPS36_L1, VPS36_L2 in marmoset.
Figure 5.
Figure 5.. Evolution, selection, and disease susceptibility of the RGPD gene family.
(a) Schematic depicting RGPD genes (red dots) compared to its progenitor gene RANBP2 (orange dot) in apes. Shared ancestral copies among the lineages are indicated (vertical arrows) in contrast to lineage-specific duplications (black) or gene conversion events (blue dashed arced arrow). (b) A maximum likelihood tree based on a 58.98 kbp MSA of 40 RGPD great ape copies outgrouped with a sole gibbon copy. The mean ages of divergence are shown above the node (95% CI blue bar) for human (H), bonobo (B), chimpanzee (C), gorilla (G), orangutan (O), and gibbon (Gib) copies. (c) A comparison of ~7 Mbp on chromosome 2 among ape genomes showing that large breakpoints in synteny (colored rectangles) often correspond to sites of RGPD SD insertions (blue arrows). (d) Human genetic diversity (pi) calculated from 94 haplotype-resolved human genomes (HPRC) for a 700 kbp region of chromosome 2. A segment mapping to the human-specific gene RGPD1 shows the lowest genetic diversity on chromosome 2 (top panel, red arrow) in haplotypes of both African (red) and non-African (blue) descent. (e) AlphaFold predictions of the protein N-terminus structure RANBP2 (blue), hRGPD1 (pink), and hRGPD2 (green) predict that differences in amino acid composition alter the secondary structure of two alpha helices (α1 and α2) in the human-specific RGPD1 copy. (f) SafFire plot (top panel) comparing the chimpanzee and human genomes highlights the formation of a 350 kbp human-specific duplication creating RGPD6 (red shading). (g) Analysis of 102 human haplotypes shows that the RGPD6 locus is largely fixed among all humans but that the organization of the flanking SDs differs significantly. We identify 11 distinct structural haplotypes in the human population predicting both disease susceptibility as well as protective haplotypes for nonallelic homologous recombination (NAHR).

Update of

References

    1. Gibbs RA (2020). The Human Genome Project changed everything. Nature reviews Genetics 21, 575–576. - PMC - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. - PubMed
    1. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. (2022). The complete sequence of a human genome. Science 376, 44–53. - PMC - PubMed
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al. (2001). The sequence of the human genome. Science 291, 1304–1351. - PubMed
    1. Watson JD (1990). The human genome project: past, present, and future. Science 248, 44–49. - PubMed

LinkOut - more resources