Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 30;116(18):8960-8965.
doi: 10.1073/pnas.1820813116. Epub 2019 Apr 15.

Functional characterization of 3D protein structures informed by human genetic diversity

Affiliations

Functional characterization of 3D protein structures informed by human genetic diversity

Michael Hicks et al. Proc Natl Acad Sci U S A. .

Abstract

Sequence variation data of the human proteome can be used to analyze 3D protein structures to derive functional insights. We used genetic variant data from nearly 140,000 individuals to analyze 3D positional conservation in 4,715 proteins and 3,951 homology models using 860,292 missense and 465,886 synonymous variants. Sixty percent of protein structures harbor at least one intolerant 3D site as defined by significant depletion of observed over expected missense variation. Structural intolerance data correlated with deep mutational scanning functional readouts for PPARG, MAPK1/ERK2, UBE2I, SUMO1, PTEN, CALM1, CALM2, and TPK1 and with shallow mutagenesis data for 1,026 proteins. The 3D structural intolerance analysis revealed different features for ligand binding pockets and orthosteric and allosteric sites. Large-scale data on human genetic variation support a definition of functional 3D sites proteome-wide.

Keywords: deep mutational scanning; exome; genome constraint; protein structure.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement: M.H. is an employee of Human Longevity, Inc. J.C.V. owns stock in Human Longevity, Inc.

Figures

Fig. 1.
Fig. 1.
Three-dimensional tolerance to variation in the proteome. (A) Missense variation data from genome and exome sequencing projects are mapped to 3D protein structures. Features extracted from Uniprot are also mapped to the 3D structures. Using these features as reference points, a 3D context is constructed, and the corresponding genetic data are extracted. A 3DTS is generated from this information. The 3DTS values are projected back onto the 3D structure. (B) The distribution of tolerance values across the structural proteome for 139,535 3D sites for structures representing 4,715 proteins. The 3DTS value at the 20th percentile (3DTS < 0.14) is used to define intolerant sites. (C) Median 3DTS for a subset of feature types with the interquartile ranges (IQR). The number of each feature type with a 3DTS value is shown above each column. The overall median across the structural proteome is represented by a horizontal dashed line. Feature types are colored by subsections defined by Uniprot (https://www.uniprot.org/help/sequence_annotation).
Fig. 2.
Fig. 2.
Validation of 3DTS. (A) Comparison of deep mutational screen data and in silico 3DTS data for the DNA-binding and ligand-binding domains of PPARG. (Top) Projection of the functional scores described in Majithia et al. (23) for each amino acid and the scores averaged across the 3DTS-defined sites for the crystal structure 3dzy (32). The color scheme is chosen to match the one described in Majithia et al. (Bottom) A projection of 3DTS onto PPARG is seen on the Left, and the 3D site level correlation between 3DTS and the 3D site averaged in vitro functional scores is shown in the plot on the Right. (B) Comparison of deep mutational screen data and 3DTS under different modeling assumptions for all available PDB structures covering 70% of the canonical protein length for nine genes. “Structure” refers to 3D sites defined by secondary structure elements, and “Allfeatures” uses 3D sites defined by all Uniprot features as detailed in the Materials and Methods. “Constant” and “heptamer” refer to the mutation rates as discussed in the Materials and Methods. (C) Comparison of the optimal 3DTS model to 23 other scoring methods at the 3D site level for nine genes. Pearson r2 values for comparisons of deep mutational screen data and in silico data at the 3D site level for the nine genes are provided. “NaN” refers to methods with unavailable scores. (D) Shallow mutagenesis data proteome-wide. Here, 3DTS identifies functional sites (loss of function) as more constrained (lower 3DTS values) at all levels of global gene essentiality compared with the rest of the protein. pLI > 0.9 (essential gene) functional to background Kolmogorov–Smirnov two-sided test P value = 9.3E-31; 0.1 > pLI > 0.9 functional to background Kolmogorov–Smirnov two-sided test P value = 2.3E-20; pLI < 0.1 functional to Kolmogorov–Smirnov two-sided test P value = 1.1E-18.
Fig. 3.
Fig. 3.
Characteristics of druggable sites. (A) Binned 3DTS scores describing active sites, allosteric sites, protein–protein interaction sites, drug ligand-binding sites, and background. The sum of each site type is 1. Active-site background Kolmogorov–Smirnov two-sided test P value = 4.9E-110. Allosteric background Kolmogorov–Smirnov two-sided test P value = 1.1E-84. Protein–protein interactions background Kolmogorov–Smirnov two-sided test P value = 1.8E-89. Drug ligand-binding background Kolmogorov–Smirnov two-sided test P value = 3.0E-75. (B) Counts of tolerant and intolerant drug ligand-binding sites grouped by therapeutic area. Here, tolerant is defined as 3DTS > 0.24 (50th percentile of 3DTS), while intolerant is defined as described in the text (3DTS < 0.14; 20th percentile of 3DTS); drug binding sites between these 3DTS values are not included. See Dataset S3 for full details about this dataset.

References

    1. Auton A. 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
    1. Telenti A, et al. Deep sequencing of 10,000 human genomes. Proc Natl Acad Sci USA. 2016;113:11901–11906. - PMC - PubMed
    1. Lek M. Exome Aggregation Consortium Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. - PMC - PubMed
    1. Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med. 2014;371:1170. - PubMed
    1. Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–315. - PMC - PubMed