Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 24;18(1):e1009818.
doi: 10.1371/journal.pcbi.1009818. eCollection 2022 Jan.

The structural coverage of the human proteome before and after AlphaFold

Affiliations

The structural coverage of the human proteome before and after AlphaFold

Eduard Porta-Pardo et al. PLoS Comput Biol. .

Abstract

The protein structure field is experiencing a revolution. From the increased throughput of techniques to determine experimental structures, to developments such as cryo-EM that allow us to find the structures of large protein complexes or, more recently, the development of artificial intelligence tools, such as AlphaFold, that can predict with high accuracy the folding of proteins for which the availability of homology templates is limited. Here we quantify the effect of the recently released AlphaFold database of protein structural models in our knowledge on human proteins. Our results indicate that our current baseline for structural coverage of 48%, considering experimentally-derived or template-based homology models, elevates up to 76% when including AlphaFold predictions. At the same time the fraction of dark proteome is reduced from 26% to just 10% when AlphaFold models are considered. Furthermore, although the coverage of disease-associated genes and mutations was near complete before AlphaFold release (69% of Clinvar pathogenic mutations and 88% of oncogenic mutations), AlphaFold models still provide an additional coverage of 3% to 13% of these critically important sets of biomedical genes and mutations. Finally, we show how the contribution of AlphaFold models to the structural coverage of non-human organisms, including important pathogenic bacteria, is significantly larger than that of the human proteome. Overall, our results show that the sequence-structure gap of human proteins has almost disappeared, an outstanding success of direct consequences for the knowledge on the human genome and the derived medical applications.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Current coverage of the human proteome.
a-b) Barplot showing the absolute (a) or relative (b) number of PDB coordinate files mapping to human proteomes at >95%, 95–50% and 50–20% thresholds of sequence identity. Legends in barplots a and b are the same. c) Evolution of the coverage of the human proteome by three-dimensional coordinate files in the Protein Data Bank (y-axis) according to the minimum percent identity of the BLAST hits (x-axis). Each line represents the coverage using only the coordinate files available in PDB in a given year. d) Barplot showing the coverage of the human proteome by different types of structural features, both linear (PFAM domains and IDRs) and three-dimensional (PDB) (y-axis is the same as in c). e) Coverage of the proteome by different AlphaFold pLDDT score thresholds (y-axis is the same as in c). f) Coverage (y-axis) of different types of regions (x-axis) depending on AlphaFold confidence levels. g) Current coverage (y-axis) of the human proteome.
Fig 2
Fig 2. Changes in the structural coverage at the protein level after AlphaFold.
a) Histogram showing the number of proteins (y-axis) according to their structural coverage (x-axis) before (left) and after (right) the release of AlphaFold models. b) Histogram showing the number of proteins for which we previously had less than 1% of structural coverage (y-axis) according to their current structural coverage after AlphaFold. c) Same as b but now including only high-confidence (pLDDT > 90) AlphaFold predictions (x-axis). d) Histogram showing how much AlphaFold high-confidence predictions contribute (x-axis) to our coverage of proteins with >95% structural coverage. e-g) AlphaFold models for previously structureless AGMO, DEGS1 and PEMT proteins. Models are colored in blue-red scale showing the pLDDT score for the residue, with red representing low pLDDT and blue high pLDDT.
Fig 3
Fig 3. Changes in structural coverage of biomedical proteins due to AlphaFold models.
a) Current structural coverage (y-axis) of different subsets of proteins (x-axis). Bars are colored according to the source of the structural coverage. b) Same as a but focusing on Clinvar mutations classified by their pathogenicity (x-axis). c) Same as a but focusing on somatic mutations from TCGA, classified by their likely oncogenicity (x-axis). d) Same as a but focusing on oncogenic mutations from BoostDM. e) AlphaFold model for B3GALT6. Residues are colored according to their pLDDT from red (lower values) to blue (higher values). Pathogenic mutations from Clinvar are highlighted in yellow. e) AlphaFold model for MED12. Coloring is the same as for d, but yellow residues indicate oncogenic mutations.
Fig 4
Fig 4. Changes in protein structural coverage in other organisms.
a) Comparison of the structural coverage (y-axis) of the five different organisms (x-axis) based on PDB sequence identity. b) Additional structural coverage provided by AlphaFold models in the different species, split by pLDDT score. c) Current high quality structural coverage of the five organisms combining PDB and AlphaFold data.

References

    1. Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC. A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature. 1958;181: 662–666. doi: 10.1038/181662a0 - DOI - PubMed
    1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al.. The Protein Data Bank. Nucleic Acids Res. 2000;28: 235–242. doi: 10.1093/nar/28.1.235 - DOI - PMC - PubMed
    1. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5: 823–826. - PMC - PubMed
    1. Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993;234: 779–815. doi: 10.1006/jmbi.1993.1626 - DOI - PubMed
    1. Godzik A, Kolinski A, Skolnick J. Topology fingerprint approach to the inverse protein folding problem. J Mol Biol. 1992;227: 227–238. doi: 10.1016/0022-2836(92)90693-e - DOI - PubMed

Publication types