Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 2;380(6648):eabn8153.
doi: 10.1126/science.abn8197. Epub 2023 Jun 2.

The landscape of tolerated genetic variation in humans and primates

Hong Gao #  1 Tobias Hamp #  1 Jeffrey Ede  1 Joshua G Schraiber  1 Jeremy McRae  1 Moriel Singer-Berk  2 Yanshen Yang  1 Anastasia S D Dietrich  1 Petko P Fiziev  1 Lukas F K Kuderna  1   3 Laksshman Sundaram  1 Yibing Wu  1 Aashish Adhikari  1 Yair Field  1 Chen Chen  1 Serafim Batzoglou  1 Francois Aguet  1 Gabrielle Lemire  2   4 Rebecca Reimers  4   5 Daniel Balick  5   6 Mareike C Janiak  7 Martin Kuhlwilm  3   8   9 Joseph D Orkin  3   10 Shivakumara Manu  11   12 Alejandro Valenzuela  3 Juraj Bergman  13   14 Marjolaine Rousselle  13 Felipe Ennes Silva  15   16 Lidia Agueda  17 Julie Blanc  17 Marta Gut  17 Dorien de Vries  7 Ian Goodhead  7 R Alan Harris  18 Muthuswamy Raveendran  18 Axel Jensen  19 Idriss S Chuma  20 Julie E Horvath  21   22   23   24   25 Christina Hvilsom  26 David Juan  3 Peter Frandsen  26 Fabiano R de Melo  27 Fabrício Bertuol  28 Hazel Byrne  29 Iracilda Sampaio  30 Izeni Farias  28 João Valsecchi do Amaral  31   32   33 Mariluce Messias  34   35 Maria N F da Silva  36 Mihir Trivedi  12 Rogerio Rossi  37 Tomas Hrbek  28   38 Nicole Andriaholinirina  39 Clément J Rabarivola  39 Alphonse Zaramody  39 Clifford J Jolly  40 Jane Phillips-Conroy  41 Gregory Wilkerson  42 Christian Abee  42 Joe H Simmons  42 Eduardo Fernandez-Duque  43   44 Sree Kanthaswamy  45 Fekadu Shiferaw  46 Dongdong Wu  47 Long Zhou  48 Yong Shao  47 Guojie Zhang  48   49   47   50   51 Julius D Keyyu  52 Sascha Knauf  53 Minh D Le  54 Esther Lizano  3   55 Stefan Merker  56 Arcadi Navarro  3   57   58   59 Thomas Bataillon  13 Tilo Nadler  60 Chiea Chuen Khor  61 Jessica Lee  62 Patrick Tan  61   63   64 Weng Khong Lim  63   64   65 Andrew C Kitchener  66   67 Dietmar Zinner  68   69   70 Ivo Gut  17   71 Amanda Melin  72   73   74 Katerina Guschanski  19   75 Mikkel Heide Schierup  13 Robin M D Beck  7 Govindhaswamy Umapathy  11   12 Christian Roos  76 Jean P Boubli  7 Monkol Lek  77 Shamil Sunyaev  5   6 Anne O'Donnell-Luria  2   4   78 Heidi L Rehm  2   77   79 Jinbo Xu  1   80 Jeffrey Rogers  18 Tomas Marques-Bonet  3   17   55   57 Kyle Kai-How Farh  1
Affiliations

The landscape of tolerated genetic variation in humans and primates

Hong Gao et al. Science. .

Abstract

Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.

PubMed Disclaimer

Conflict of interest statement

Employees of lllumina, Inc. are indicated in the list of author affiliations. Serafim Batzoglou is currently affiliated with Seer, Inc. Heidi L. Rehm receives funding to support rare disease research and tool development from lllumina, Inc. and Microsoft, Inc. Patents related to this work are (1) title: Deep convolutional neural networks to predict variant pathogenicity using three-dimensional (3D) protein structures, filing number US 17/232,056, authors: Tobias Hamp, Kai-How Farh, Hong Gao; (2) title: Transfer learning-based use of protein contact maps for variant pathogenicity prediction, filing No.: US 17/876,481, authors: Chen Chen, Hong Gao, Laksshman Sundaram, Kai-How Farh; (3) title: Multichannel protein voxelization to predict variant pathogenicity using deep convolutional neural networks, filing number US 17/703,935, authors: Tobias Hamp, Kai-How Farh, Hong Gao;(4) title: Transformer language model for variant pathogenicity, filing number US 17/975,536 and US 17/975,547, authors: Jeffrey Ede, Tobias Hamp, Anastasia Dietrich, Yibing Wu, Kai-How Farh. (5) title: Identifying genes with differential selective constraint between humans and nonhuman primates, filing number US 63/294,820, authors: H. G., J. G. Schraiber, K.-H. Farh.

Figures

Fig. 1.
Fig. 1.. Common primate variants are largely benign in humans.
(A) Counts of missense (solid green) and synonymous (shaded gray) variants from primates compared with the gnomAD database. Missense:synonymous counts and ratios are displayed above each bar. (B) Fractions of all possible human synonymous (gray) and missense variants (green) observed in primates. (C) Counts of benign (gray) and pathogenic (red) missense variants with two-star review status or above in the overall ClinVar database (left pie chart), compared with ClinVar variants observed in gnomAD (middle), and compared with ClinVar variants observed in primates (right). Conflicting benign and pathogenic annotations and variants interpreted only with uncertain significance were excluded. (D) Observed gnomAD (green) or primate (blue) missense variants in each amino acid position in the CACNA1A gene. Red circles represent the positions of annotated ClinVar pathogenic missense variants. Bottom scatterplot shows PrimateAI-3D predicted pathogenicity scores for all possible missense substitutions along the gene. (E) Multiple sequence alignment showing the ClinVar pathogenic variant chrll:77181548 G>A (red arrow) creating a cryptic splice site in human sequence (extended splice motif, blue). This variant is tolerated in Cebus Albifrons and other species with a G>C synonymous change in the adjacent nucleotide that stops the splice motif from forming. (F) Pie charts showing the fraction of benign (gray) and pathogenic (red) missense variants with ClinVar two-star review status or above in great apes, Old World monkeys, New World monkeys, lemurs/tarsiers, mammals, chicken, and zebrafish. (G) Missense:synonymous ratios (MSR) across the human allele frequency spectrum, with MSR of human variants seen in primates shown for comparison. The blue dashed line represents the expected missense:synonymous ratio of de novo variants. Colors and legend are the same as (A).
Fig. 2.
Fig. 2.. Selective constraint of primate genes compared with humans.
(A) Scatter plot of missense: synonymous ratios between primate and human genes. Each gene is colored by its pLI score, with darker points showing haploinsufficient genes. (B) Observed and expected counts of synonymous (top) and missense (bottom) variants per gene in gnomAD (left) and primates (right). Genes are colored by their pLI scores. (C) Distributions of observed and expected ratios of synonymous (dashed lines) and missense (solid lines) variants for all genes. Results for primate genes (orange) and gnomAD genes (blue) are shown. (D) Scatter plot of missense:synonymous ratios between primate and human genes. Highlighted points are genes that are under significantly stronger (blue) or weaker (red) constraint in humans compared with nonhuman primates under both methods (Benjamini-Hochberg FDR < 0.05) and gray points show nonsignificant genes. The top 10 genes with the largest effect sizes in either direction are labeled.
Fig. 3.
Fig. 3.. PrimateAI-3D architecture and variant classification performance.
(A) PrimateAI-3D workflow. Human protein structures and multiple sequence alignments are voxelized (left) as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a target residue (middle). The network is trained using a loss function with three components (right): common human and primate variants; fill-in-the-blank of a protein structure; score ranks from language models. (B) Protein structure of the STK11 gene, colored by PrimateAI-3D pathogenicity prediction scores (blue, benign; red, pathogenic). Spheres indicate residues with common human and primate variants (left) or residues with pathogenic mutations from ClinVar (right). For spheres, the color corresponds to the pathogenicity score of only the variant. For other residues, pathogenicity scores are averaged over all variants at that site. (C) Scatterplot shows performance of methods that predict missense variant pathogenicity in two clinical benchmarks (DDD and UKBB). Datasets are a subset of variants for which all methods have predictions. (D) Six barplots show method performance for six testing datasets (DMS assays, UKBB, ClinVar, DDD, ASD, and CHD).
Fig. 4.
Fig. 4.. Impact of training data-set size on classification accuracy.
(A) Improved performance of PrimateAI-3D with increasing number of common human and primate variants in the training dataset (x-axis). Performance of each dataset (y-axis) was divided by the maximum performance observed ; across all training dataset sizes. (B) Cumulative fractions of all possible human synonymous (gray) and missense (green) variants observed as common variants in 234 primate species, including humans (allele frequency > 0.1%). Each point shows the average of 10 permutations, calculated with a different random ordering of the list of primate species each time.
Fig. 5.
Fig. 5.. Enrichment of de novo mutations in the neurodevelopmental disorder cohort over expectation.
(A) Enrichment of DNMs from Kaplanis et al. (87) across all genes. Enrichment ratios are given for synonymous, all missense, and protein-truncating variants (PTV), along with missense split by PrimateAI-3D score into benign (<0.821) and pathogenic (>0.821). (B) Enrichment of benign and pathogenic missense above expectation at varying PrimateAI-3D thresholds for classifying pathogenic missense.

Update of

  • The landscape of tolerated genetic variation in humans and primates.
    Gao H, Hamp T, Ede J, Schraiber JG, McRae J, Singer-Berk M, Yang Y, Dietrich A, Fiziev P, Kuderna L, Sundaram L, Wu Y, Adhikari A, Field Y, Chen C, Batzoglou S, Aguet F, Lemire G, Reimers R, Balick D, Janiak MC, Kuhlwilm M, Orkin JD, Manu S, Valenzuela A, Bergman J, Rouselle M, Silva FE, Agueda L, Blanc J, Gut M, de Vries D, Goodhead I, Harris RA, Raveendran M, Jensen A, Chuma IS, Horvath J, Hvilsom C, Juan D, Frandsen P, de Melo FR, Bertuol F, Byrne H, Sampaio I, Farias I, do Amaral JV, Messias M, da Silva MNF, Trivedi M, Rossi R, Hrbek T, Andriaholinirina N, Rabarivola CJ, Zaramody A, Jolly CJ, Phillips-Conroy J, Wilkerson G, Abee C, Simmons JH, Fernandez-Duque E, Kanthaswamy S, Shiferaw F, Wu D, Zhou L, Shao Y, Zhang G, Keyyu JD, Knauf S, Le MD, Lizano E, Merker S, Navarro A, Batallion T, Nadler T, Khor CC, Lee J, Tan P, Lim WK, Kitchener AC, Zinner D, Gut I, Melin A, Guschanski K, Schierup MH, Beck RMD, Umapathy G, Roos C, Boubli JP, Lek M, Sunyaev S, O'Donnell A, Rehm H, Xu J, Rogers J, Marques-Bonet T, Kai-How Farh K. Gao H, et al. bioRxiv [Preprint]. 2023 May 2:2023.05.01.538953. doi: 10.1101/2023.05.01.538953. bioRxiv. 2023. Update in: Science. 2023 Jun 2;380(6648):eabn8153. doi: 10.1126/science.abn8197. PMID: 37205491 Free PMC article. Updated. Preprint.

Comment in

References

    1. MacArthur DG et al., Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014). doi:10.1038/naturel3127; pmid: 24759409 - DOI - PMC - PubMed
    1. Nussbaum RL, Rehm HL; ClinGen, ClinGen and Genetic Testing. N. Engl. J. Med 373,1379 (2015). pmid: 26430707 - PubMed
    1. Rehm HL et al., ClinGen—The Clinical Genome Resource.N. Engl. J. Med 372, 2235–2242 (2015). doi: 10.1056/NEJMsrl406261; pmid: 26014595 - DOI - PMC - PubMed
    1. Landrum MJ et al., ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016). doi: 10.1093/nar/gkvl222; pmid: 26582918 - DOI - PMC - PubMed
    1. Liu X, Wu C, Li C, Boerwinkle E, dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum. Mutat 37, 235–241 (2016). doi: 10.1002/humu.22932; pmid: 26555599 - DOI - PMC - PubMed