Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 18;370(6523):eabc6617.
doi: 10.1126/science.abc6617.

Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility

Wesley C Warren  1   2   3 R Alan Harris  4 Marina Haukness  5 Ian T Fiddes  6 Shwetha C Murali  7   8 Jason Fernandes  9 Philip C Dishuck  7 Jessica M Storer  10   11 Muthuswamy Raveendran  4 LaDeana W Hillier  7 David Porubsky  7 Yafei Mao  7 David Gordon  7   8 Mitchell R Vollger  7 Alexandra P Lewis  7 Katherine M Munson  7 Elizabeth DeVogelaere  5 Joel Armstrong  5 Mark Diekhans  5 Jerilyn A Walker  10 Chad Tomlinson  12 Tina A Graves-Lindsay  12 Milinn Kremitzki  12 Sofie R Salama  9 Peter A Audano  7 Merly Escalona  9 Nicholas W Maurer  9 Francesca Antonacci  13 Ludovica Mercuri  13 Flavia A M Maggiolini  13 Claudia Rita Catacchio  13 Jason G Underwood  14 David H O'Connor  15 Ashley D Sanders  16 Jan O Korbel  16 Betsy Ferguson  17 H Michael Kubisch  18 Louis Picker  19 Ned H Kalin  20 Douglas Rosene  21 Jon Levine  22   23 David H Abbott  23   24 Stanton B Gray  25 Mar M Sanchez  26   27 Zsofia A Kovacs-Balint  26 Joseph W Kemnitz  23   28 Sara M Thomasy  29   30 Jeffrey A Roberts  31 Erin L Kinnally  31   32 John P Capitanio  31   32 J H Pate Skene  33 Michael Platt  34 Shelley A Cole  35 Richard E Green  9 Mario Ventura  13 Roger W Wiseman  15 Benedict Paten  5 Mark A Batzer  10 Jeffrey Rogers  36 Evan E Eichler  37   8
Affiliations

Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility

Wesley C Warren et al. Science. .

Abstract

The rhesus macaque (Macaca mulatta) is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing financial interests.

Figures

Figure 1.
Figure 1.. Rhesus macaque genome assembly quality and contiguity.
(A) The number of gaps and contig N50 lengths are compared among mammalian genomes (color-coded based on sequencing technology). The contiguity of macaque (Mmul_10) is comparable to human (GRCh38) and mouse (GRCm38.p6) reference genomes. (B) The number of gaps (red ticks) are compared against a synteny plot of Chinese (rheMacS) and Indian (Mmul_10) macaque chromosome 3 assemblies. (C) Comparison of potential orientation misassemblies based on Strand-seq analysis (18). Mmul_10 shows far fewer (yellow; n = 13) inversions when compared to an earlier macaque assembly, Mmul_8.0.1 (blue; n = 82), predicting 34 times less misoriented bases; 99.7% of gaps (white rectangles) in the earlier assembly are now closed.
Figure 2.
Figure 2.. Novel genes and gene models.
(A) A novel gene model with homology to the cytochrome p450 protein family is predicted by the AugustusPB mode of the CAT. The gene structure and protein domain architecture of three isoforms are shown (top). The predictions arose from supporting Iso-Seq reads from five tissues (middle). Orthologous novel genes are also predicted in marmoset, orangutan, and gorilla assemblies; a protein alignment (bottom) of those genes along with a human CYP2C18 protein is shown. (B) Two macaque isoforms in ELN (tropoelastin) are predicted by the AugustusPB mode of CAT and are supported by macaque Iso-Seq data but differ significantly from human by two exons. The gene structure and functional domains for the last seven exons of this gene are shown (top), along with a comparison to a human transcript model. These two protein-encoding exons are also observed in marmoset, owl monkey, and mouse, but not in apes, as a result of an ape-specific deletion (bottom) that changed the gene structure of tropoelastin.
Figure 3.
Figure 3.. Macaque ZNF669 gene family expansion.
(A) A 68 kbp region of collapsed assembly corresponding to the ZNF669 gene family as indicated by the excess read depth and increased number of paralogous sequence variants (PSVs, red dots) that are diverged when compared to the consensus sequence (black dots). The highly identical copies were, thus, unresolved in Mmul_10 and predicted to be present in about 50 copies in macaque (left). Segmental Duplication Assembler (SDA) partitions the long reads into 19 distinct paralog clusters (colored and numbered) based on shared PSVs and assembled these clusters into 18 contigs. Vertices reflect individual PSVs and edges represent long-read sequences that contain both of the connected PSVs (right). SDA partitioned and assembled the remaining ZNF669 collapses into 35 additional contigs. The outlined PSV cluster corresponds to contig 2 in panel B. (B) Mapping of FLNC transcripts shows they align better to SDA-resolved contigs than the original assembly. (C) Annotation of these genes shows that these three contigs encode a highly expanded ZNF669 gene family where there is FLNC data supporting complete open reading frames that differ by only a few amino acids. (D) FISH with BAC CH250–540H16 as a probe corresponding to a ZNF669 locus demonstrate interchromosomal duplications (red) on interphase nucleus (right) and metaphase chromosomes (left), labeled by chromosome (human syntenic chromosome in parentheses).
Figure 4.
Figure 4.. Full-length LINE1 analyses.
A LINE1 subfamily network analysis comparing (A) an earlier macaque assembly, Mmul_8.0.1 (61 subfamilies), to (B) the new assembly, Mmul_10 (58 subfamilies) (18). Related subfamilies are connected by lines and clustered by color: L1RS37 (purple), L1PA7/8 (blue), L1PA6 (green), L1RS36 (pink), L1RS2 (red), L1RS10/16/21 (orange), and L1RS25 (teal). The size of each node corresponds to the relative number of LINE1 elements. There is an increase in annotated younger elements (orange) although the number of subfamilies has decreased the L1RS36 cluster as a result of reassignment based on a higher-quality assembly. (C) The plot depicts the number of full-length L1 elements (y-axis) that have been assigned to a new chromosome in Mmul_10 (key) when compared to Mmul_8.0.1 (x-axis). (D) A similar analysis depicting the number of full-length LINE1 elements previously unplaced (n = 92) but now assigned to a chromosomal location in Mmul_10.
Figure 5.
Figure 5.. Evolution of L1RS elements.
(A) (Left) All full-length L1RS elements (>6000 nt, top schematic) were grouped by families and mapped to a consensus version of L1PA5 (the ancestral LINE-1 element from which they derive) with the first 700 nt (red) of the 5’ UTR analyzed further. Site 1 (brown) experiences a coverage drop that is found in the majority of L1RS16 and younger families. Coverage drops at Site 2 (blue) and Site 3 (yellow) occur in the L1RS21 family at nearly the same time. (Right) Percentage of individual instances that do not map to the L1PA5 consensuses for each L1RS family. Coverage drops are not found in old L1RS elements but found in nearly all young elements, suggesting a fitness advantage for the changes at each site. (B) (Left) All full-length elements (>6000 nt) of the youngest L1RS families in four OWM genomes (L1RS10 in Rrox_v1/rhiRox1 [golden snub-nosed monkey] and L1RS2 in Panu_3.0/papAnu4 [baboon], Macaca_fascicularis_5.0/macFas5 [crab-eating macaque], and Mmul_10/rheMac10) were aligned to the L1PA5 consensus to generate coverage plots. The youngest human L1 (L1HS) was also aligned to L1PA5 as an outgroup. Drops in coverage (Site 1, Site 2 and Site 3) were seen in OWM, although golden snub-nosed monkeys (Rrox_v1/rhiRox1) display distinct patterns from other OWM suggesting convergent but distinct changes in the 5’ UTR, possibly to escape repressive elements. (Right) An evolutionary model for shared and convergent changes in L1RS elements. Site 1 changes are shared amongst all OWM while Site 2 and 3 changes experience similar but not exact changes in Rrox_v1/rhiRox1 compared to other OWM. Coverage drops at Sites 1 and 3 are also observed in human while Site 2 changes are OWM specific. (C) Schematic of Site 1, 2, and 3 (brown, blue, yellow) changes on the L1 5’ UTR in representative lineages: human, golden snub-nosed monkey, and rhesus. Rhesus macaque and golden snub-nosed monkey have identical coverage drops at Sites 1 and 2 that arose in the OWM common ancestor; golden snub-nosed monkeys also experience larger changes (larger bars) spanning these sites that most likely occurred after the Colobinae divergence as they are not observed in rhesus. Humans experience a unique coverage drop at Site 1 larger than rhesus but smaller than the large golden snub-nosed monkey-specific changes. All three species experience unique changes resulting in differing length elements at Site 3.
Figure 6.
Figure 6.. Rhesus macaque population structure and developing macaque models of disease.
(A) A 3D principal component analysis (PCA) based of SNVs filtered for missing call rates > 0.05 or major allele frequency (MAF) < 0.1 from sequencing 853 macaque genomes shows clear separation of Chinese (PC1) genomes (red) and a gradient for Cayo macaques (green) with respect to other Indian macaques (PC2). (B) A PCA excluding Chinese and Cayo populations comparing 771 macaques from different NPRCs. The Cattell–Nelson–Gorsuch (CNG) screen test retained the top three principal components in both PCAs and the percent variance explained calculations are based on those three components. (C) Allele frequency distribution of likely gene-disruptive (LGD) including splice acceptor, splice donor, stop gained, stop loss and start loss variants (red) and missense (blue) variants compared to synonymous changes (green). (D) Genes implicated in human neurodevelopmental disorders (NDDs) showing naturally occurring putatively damaging variants in macaque orthologs. A schematic of damaging missense (blue) variants (CADD >= 25) for NDD genes: MBD5, ARID1B, and SHANK3. For each variant, we indicate the amino acid change| CADD score| allele count. All potentially deleterious mutations are low frequency.

References

    1. Bailey JA, Eichler EE, Primate segmental duplications: crucibles of evolution, diversity and disease. Nat Rev Genet 7, 552–564 (2006). - PubMed
    1. Rhesus Macaque Genome S et al., Evolutionary and biomedical insights from the rhesus macaque genome. Science 316, 222–234 (2007). - PubMed
    1. Xue C et al., The population genomics of rhesus macaques (Macaca mulatta) based on whole-genome sequences. Genome Res 26, 1651–1662 (2016). - PMC - PubMed
    1. Bimber BN et al., Whole genome sequencing predicts novel human disease models in rhesus macaques. Genomics 109, 214–220 (2017). - PMC - PubMed
    1. Kronenberg ZN et al., High-resolution comparative analysis of great ape genomes. Science 360, (2018). - PMC - PubMed

Publication types