Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 2:2023.05.01.538953.
doi: 10.1101/2023.05.01.538953.

The landscape of tolerated genetic variation in humans and primates

Hong Gao  1 Tobias Hamp  1 Jeffrey Ede  1 Joshua G Schraiber  1 Jeremy McRae  1 Moriel Singer-Berk  2 Yanshen Yang  1 Anastasia Dietrich  1 Petko Fiziev  1 Lukas Kuderna  1   3 Laksshman Sundaram  1 Yibing Wu  1 Aashish Adhikari  1 Yair Field  1 Chen Chen  1 Serafim Batzoglou  1 Francois Aguet  1 Gabrielle Lemire  2   4 Rebecca Reimers  4 Daniel Balick  5 Mareike C Janiak  6 Martin Kuhlwilm  3   7   8 Joseph D Orkin  3   9 Shivakumara Manu  10   11 Alejandro Valenzuela  3 Juraj Bergman  12   13 Marjolaine Rouselle  12 Felipe Ennes Silva  14   15 Lidia Agueda  16 Julie Blanc  16 Marta Gut  16 Dorien de Vries  6 Ian Goodhead  6 R Alan Harris  17 Muthuswamy Raveendran  17 Axel Jensen  18 Idriss S Chuma  19 Julie Horvath  20   21   22   23   24 Christina Hvilsom  25 David Juan  3 Peter Frandsen  25 Fabiano R de Melo  26 Fabricio Bertuol  27 Hazel Byrne  28 Iracilda Sampaio  29 Izeni Farias  27 João Valsecchi do Amaral  30   31   32 Mariluce Messias  33   34 Maria N F da Silva  35 Mihir Trivedi  11 Rogerio Rossi  36 Tomas Hrbek  27   37 Nicole Andriaholinirina  38 Clément J Rabarivola  38 Alphonse Zaramody  38 Clifford J Jolly  39 Jane Phillips-Conroy  40 Gregory Wilkerson  41 Christian Abee  42 Joe H Simmons  41 Eduardo Fernandez-Duque  42   43 Sree Kanthaswamy  44 Fekadu Shiferaw  45 Dongdong Wu  46 Long Zhou  47 Yong Shao  46 Guojie Zhang  47   48   49   50   51 Julius D Keyyu  52 Sascha Knauf  53 Minh D Le  54 Esther Lizano  3   55 Stefan Merker  56 Arcadi Navarro  3   57   58   59 Thomas Batallion  12 Tilo Nadler  60 Chiea Chuen Khor  61 Jessica Lee  62 Patrick Tan  61   63   64 Weng Khong Lim  63   64   65 Andrew C Kitchener  66   67 Dietmar Zinner  68   69 Ivo Gut  16   70 Amanda Melin  71   72 Katerina Guschanski  18   73 Mikkel Heide Schierup  12 Robin M D Beck  6 Govindhaswamy Umapathy  10   11 Christian Roos  74 Jean P Boubli  6 Monkol Lek  75 Shamil Sunyaev  76   5 Anne O'Donnell  2   4   77 Heidi Rehm  2   78 Jinbo Xu  1   79 Jeffrey Rogers  17 Tomas Marques-Bonet  3   16   55   57 Kyle Kai-How Farh  1
Affiliations

The landscape of tolerated genetic variation in humans and primates

Hong Gao et al. bioRxiv. .

Update in

  • The landscape of tolerated genetic variation in humans and primates.
    Gao H, Hamp T, Ede J, Schraiber JG, McRae J, Singer-Berk M, Yang Y, Dietrich ASD, Fiziev PP, Kuderna LFK, Sundaram L, Wu Y, Adhikari A, Field Y, Chen C, Batzoglou S, Aguet F, Lemire G, Reimers R, Balick D, Janiak MC, Kuhlwilm M, Orkin JD, Manu S, Valenzuela A, Bergman J, Rousselle M, Silva FE, Agueda L, Blanc J, Gut M, de Vries D, Goodhead I, Harris RA, Raveendran M, Jensen A, Chuma IS, Horvath JE, Hvilsom C, Juan D, Frandsen P, de Melo FR, Bertuol F, Byrne H, Sampaio I, Farias I, do Amaral JV, Messias M, da Silva MNF, Trivedi M, Rossi R, Hrbek T, Andriaholinirina N, Rabarivola CJ, Zaramody A, Jolly CJ, Phillips-Conroy J, Wilkerson G, Abee C, Simmons JH, Fernandez-Duque E, Kanthaswamy S, Shiferaw F, Wu D, Zhou L, Shao Y, Zhang G, Keyyu JD, Knauf S, Le MD, Lizano E, Merker S, Navarro A, Bataillon T, Nadler T, Khor CC, Lee J, Tan P, Lim WK, Kitchener AC, Zinner D, Gut I, Melin A, Guschanski K, Schierup MH, Beck RMD, Umapathy G, Roos C, Boubli JP, Lek M, Sunyaev S, O'Donnell-Luria A, Rehm HL, Xu J, Rogers J, Marques-Bonet T, Farh KK. Gao H, et al. Science. 2023 Jun 2;380(6648):eabn8153. doi: 10.1126/science.abn8197. Epub 2023 Jun 2. Science. 2023. PMID: 37262156 Free PMC article.

Abstract

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole genome sequencing data for 809 individuals from 233 primate species, and identified 4.3 million common protein-altering variants with orthologs in human. We show that these variants can be inferred to have non-deleterious effects in human based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.

One sentence summary: Deep learning classifier trained on 4.3 million common primate missense variants predicts variant pathogenicity in humans.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Employees of Illumina, Inc. are indicated in the list of author affiliations. Serafim Batzoglou is currently affiliated with Seer, Inc. Heidi Rehm receives funding to support rare disease research and tool development from Illumina, Inc. and Microsoft, Inc. Patents related to this work are (1) title: Deep convolutional neural networks to predict variant pathogenicity using three-dimensional (3D) protein structures, filing No.: US 17/232,056, authors: Tobias Hamp, Kai-How Farh, Hong Gao; (2) title: Transfer learning-based use of protein contact maps for variant pathogenicity prediction, filing No.: US 17/876,481, authors: Chen Chen, Hong Gao, Laksshman Sundaram, Kai-How Farh; (3) title: Multi-channel protein voxelization to predict variant pathogenicity using deep convolutional neural networks, filing No.: US 17/703,935, authors: Tobias Hamp, Kai-How Farh, Hong Gao;(4) title: Transformer language model for variant pathogenicity, filing No.: US 17/975,536 and US 17/975,547, authors: Jeffrey Ede, Tobias Hamp, Anastasia Dietrich, Yibing Wu, Kai-How Farh.

Figures

Fig. 1.
Fig. 1.. Common primate variants are largely benign in human.
(A) Counts of missense (solid green) and synonymous (shaded grey) variants from primates compared to the gnomAD database. Missense : synonymous counts and ratios are displayed above each bar. (B) Fractions of all possible human synonymous (grey) and missense variants (green) observed in primates. (C) Counts of benign (grey) and pathogenic (red) missense variants with two-star review status or above in the overall ClinVar database (left pie chart), compared to ClinVar variants observed in gnomAD (middle), and compared to ClinVar variants observed in primates (right). Conflicting benign and pathogenic annotations and variants interpreted only with uncertain significance were excluded. (D) Observed gnomAD (green) or primate (blue) missense variants in each amino acid position in the CACNA1A gene. Red circles represent the positions of annotated ClinVar pathogenic missense variants. Bottom scatterplot shows PrimateAI-3D predicted pathogenicity scores for all possible missense substitutions along the gene. (E) Multiple sequence alignment showing the ClinVar pathogenic variant chr11:77181548 G>A (red arrow) creating a cryptic splice site in human sequence (extended splice motif, blue). This variant is tolerated in Cebus Albifrons and other species with a G>C synonymous change in the adjacent nucleotide that stops the splice motif from forming. (F) Pie charts showing the fraction of benign (grey) and pathogenic (red) missense variants with ClinVar two-star review status or above in great apes, old world monkeys, new world monkeys, lemurs/tarsiers, mammals, chicken, and zebrafish. (G) Missense : synonymous ratios across the human allele frequency spectrum, with MSR of human variants seen in primates shown for comparison. The blue dashed line represents the expected missense : synonymous ratio of de novo variants. Colors and legend are the same as (A).
Fig. 2.
Fig. 2.. Selective constraint of primate genes compared to human.
(A) Scatter plot of missense : synonymous ratios between primate and human genes. Each gene is colored by its pLI score, with darker points showing haploinsufficient genes. (B) Observed and expected counts of synonymous (top) and missense (bottom) variants per gene in gnomAD (left) and primates (right). Genes are colored by their pLI scores. (C) Distributions of observed/expected ratios of synonymous (dashed lines) and missense (solid lines) variants for all genes. Results for primate genes (orange) and gnomAD genes (blue) are shown. (D) Scatter plot of missense : synonymous ratios between primate and human genes. Highlighted points are genes that are under significantly stronger (blue) or weaker (red) constraint in humans compared to non-human primates under both methods (Benjamini-Hochberg FDR < 0.05), while grey points show non-significant genes. The top 10 genes with the largest effect sizes in either direction are labeled.
Fig. 3.
Fig. 3.. PrimateAI-3D architecture and variant classification performance.
(A) PrimateAI-3D workflow. Human protein structures and multiple sequence alignments are voxelized (left) as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a target residue (middle). The network is trained using a loss function with three components (right): common human and primate variants; fill-in-the-blank of a protein structure; score ranks from language models. (B) Protein structure of the STK11 gene, colored by PrimateAI-3D pathogenicity prediction scores (blue: benign; red: pathogenic). Spheres indicate residues with common human and primate variants (left) or residues with pathogenic mutations from ClinVar (right). For spheres, the color corresponds to the pathogenicity score of only the variant. For other residues, pathogenicity scores are averaged over all variants at that site. (C) Scatterplot shows performance of methods that predict missense variant pathogenicity in two clinical benchmarks (DDD and UKBB). Datasets are a subset of variants for which all methods have predictions. (D) Six barplots show method performance for six testing datasets (DMS assays, UKBB, ClinVar, DDD, ASD, and CHD).
Fig. 4.
Fig. 4.. Impact of training dataset size on classification accuracy.
(A) Improved performance of PrimateAI-3D with increasing number of common human and primate variants in the training dataset (x-axis). Performance of each dataset (y-axis) was divided by the maximum performance observed across all training dataset sizes. (B) Cumulative fractions of all possible human synonymous (grey) and missense (green) variants observed as common variants in 234 primate species, including human (allele frequency > 0.1%). Each point shows the average of ten permutations, calculated with a different random ordering of the list of primate species each time.
Fig. 5.
Fig. 5.. Enrichment of de novo mutations in the neurodevelopmental disorder cohort over expectation.
(A) Enrichment of DNMs from Kaplanis et al. (87) across all genes. Enrichment ratios are given for synonymous, all missense, and protein-truncating variants (PTV), along with missense split by PrimateAI-3D score into benign (<0.821) and pathogenic (>0.821). (B) Enrichment of benign and pathogenic missense above expectation at varying PrimateAI-3D thresholds for classifying pathogenic missense.

References

    1. MacArthur D. G. et al., Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014). - PMC - PubMed
    1. Nussbaum R. L., Rehm H. L., ClinGen, ClinGen and Genetic Testing. N Engl J Med 373, 1379 (2015). - PubMed
    1. Rehm H. L. et al., ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242 (2015). - PMC - PubMed
    1. Landrum M. J. et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44, D862–868 (2016). - PMC - PubMed
    1. Liu X., Wu C., Li C., Boerwinkle E., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum Mutat 37, 235–241 (2016). - PMC - PubMed

Publication types