Integration of over 9,000 mass spectrometry experiments builds a global map of human protein complexes
- PMID: 28596423
- PMCID: PMC5488662
- DOI: 10.15252/msb.20167490
Integration of over 9,000 mass spectrometry experiments builds a global map of human protein complexes
Abstract
Macromolecular protein complexes carry out many of the essential functions of cells, and many genetic diseases arise from disrupting the functions of such complexes. Currently, there is great interest in defining the complete set of human protein complexes, but recent published maps lack comprehensive coverage. Here, through the synthesis of over 9,000 published mass spectrometry experiments, we present hu.MAP, the most comprehensive and accurate human protein complex map to date, containing > 4,600 total complexes, > 7,700 proteins, and > 56,000 unique interactions, including thousands of confident protein interactions not identified by the original publications. hu.MAP accurately recapitulates known complexes withheld from the learning procedure, which was optimized with the aid of a new quantitative metric (k-cliques) for comparing sets of sets. The vast majority of complexes in our map are significantly enriched with literature annotations, and the map overall shows improved coverage of many disease-associated proteins, as we describe in detail for ciliopathies. Using hu.MAP, we predicted and experimentally validated candidate ciliopathy disease genes in vivo in a model vertebrate, discovering CCDC138, WDR90, and KIAA1328 to be new cilia basal body/centriolar satellite proteins, and identifying ANKRD55 as a novel member of the intraflagellar transport machinery. By offering significant improvements to the accuracy and coverage of human protein complexes, hu.MAP (http://proteincomplexes.org) serves as a valuable resource for better understanding the core cellular functions of human proteins and helping to determine mechanistic foundations of human disease.
Keywords: cilia; ciliopathy; human interactome; mass spectrometry; protein complexes; proteomics.
© 2017 The Authors. Published under the terms of the CC BY 4.0 license.
Figures
Graphical schematic of spoke model applied to AP‐MS datasets. In the spoke model, all interactions must include a bait protein.
Venn diagram of overlap between published large‐scale protein interaction networks BioPlex (AP‐MS), Hein et al (AP‐MS), and Wan et al (CF‐MS). Protein interactions in BioPlex and Hein et al were generated from a spoke model.
Graphical schematic of matrix model applied to AP‐MS datasets. In the matrix model, interactions are allowed between prey proteins.
Venn diagram of overlap between protein interaction networks where a weighted matrix model was applied to BioPlex and Hein et al. Sizes of weighted matrix model protein interaction networks were kept constant with published networks (for this analysis only while the full networks were used for integration). Note an increase in the overall number of overlapping interactions when compared to (B).
Diagram of protein complex discovery workflow. Three protein interaction networks, BioPlex, Hein et al, and Wan et al, were combined into an integrated protein complex network and clustered to identify protein complexes. Parameters for the SVM and clustering algorithms were optimized on a training set of literature‐curated complexes and validated on a test set of complexes.
Graphical depiction of six AP‐MS experiments highlighting the purification of two mutually exclusive complexes A‐B‐C‐D and A‐E. Experiments 1, 2, and 3 co‐purify interactions with bait proteins A, D, and E, respectively. Experiments 4, 5, and 6 show non‐specific interactions for the complexes or sub‐complexes. For the purposes of this example, proteins A, B, C, D, and E are only observed in these six experiments out of a set of 50 experiments (arbitrarily defined).
Presence–absence matrix for the six AP‐MS experiments and five proteins described in (A). This matrix represents what the experimenter observes after mass spectrometry analysis.
Calculation of weighted matrix model score for protein pairs in highlighted complexes. The True Interaction column represents whether the pair of proteins is co‐complex or not. The Spoke Model column represents the predictions made by the traditional spoke model. Note the spokes model's false‐negative prediction of interaction B&C. The Number of co‐purifications column represents the number of experiments for which the pair of proteins is co‐purified. The Weighted Matrix Model column represents the −1*log(P‐val) of the hypergeometric test given the experimental overlap value, each protein's total number of observed experiments, and the total number of experiments (non‐depicted) arbitrarily defined as 50. The panel also shows likely clusters of co‐complex interactions using three levels of confidence, high, medium, and low. Note, the high‐ and medium‐confidence networks do not show the false‐positive interactions D&E, C&E, or B&E but do capture the true‐positive prey–prey interaction B&C.
Precision–recall curves calculated on a leave‐out set of protein interactions from literature‐curated complexes for different combinations of predictive protein interaction features. The integration of all three datasets outperforms all other networks. Also, note a substantial improvement in performance when the weighted matrix model features are used (no MatrixModel, blue vs. integrated, orange).
Performance of parameter optimization for MCL and Newman two‐stage clustering procedures. Each data point represents a set of parameters and is evaluated based on the resulting clusters similarity to both training and test sets of complexes using the F‐Grand measure (see Materials and Methods). Final parameter sets were selected based only on F‐Grand measure for the training set.
Precision–recall curves evaluating protein interactions on leave‐out set before (integrated) and after (hu.MAP) clustering procedure. Note an improvement in performance after clustering suggests the clustering procedure successfully removed false‐positive interactions.
Distribution of protein interactions in the final protein interaction network based on input evidence. Note the weighted matrix model interactions produce many high‐confident interactions. Also, the “Multiple” category shows predominately high‐confident interactions, which validates the integration of multiple datasets mitigating false positives.
Protein interactions from our complex map substantially overlap with other protein interaction datasets across a variety of experimental types.
Precision–recall curves calculated using leave‐out set of co‐complex interactions to evaluate networks trained on BioPlex only and BioPlex + weighted matrix model features. Note improvement of performance when weighted matrix model features are included.
Precision–recall curves calculated using leave‐out set of co‐complex interactions to evaluate networks trained on Hein only and Hein + weighted matrix model features. Note improvement of performance when weighted matrix model features are included.
Precision–recall curves calculated using leave‐out set of co‐complex interactions to evaluate networks trained on all features (integrated, orange), all features except HumanNet features SC‐LC, SC‐CC, CE‐LC, and CE‐CC (dashed blue), and all features except HumanNet (green). Note negligible performance loss when HumanNet features are excluded.
Comparison of hu.MAP and published complex maps to leave‐out set of complexes using precision–recall product measure (Song & Singh, 2009).
Comparison of hu.MAP and published complex maps to leave‐out set of complexes using F‐weighted k‐clique score.
Distribution of number of subunits for complexes in hu.MAP.
Presence/absence matrix of BioPlex AP‐MS experiments as rows and pulled down proteins as columns for four complexes identified in our complex map. The Exosome, eIF3 Complex, and 19S Proteasome all have multiple bait–bait interactions whereas the novel synaptic bouton complex does not have bait–bait interactions but does have substantial density in the non‐bait region of the matrix. This density is identified by the weighted matrix model and highlights the model's ability to discover protein complexes.
RNA expression profiles of proteins in the synaptic bouton complex across different tissues sampled by the Human Protein Atlas. This shows the complex is highly specific for cerebral cortex tissue. No less than six replicates were used for each tissue type. Boxes indicate median (inner band), first quartile (bottom) and third (top) quartile. Whiskers indicate 1.5 interquartile range. Dots indicate outliers.
Correlation coefficient distributions of Allen Brain Map tissue expression profiles between synaptic bouton complex proteins and random set of proteins. This shows coherence of expression among proteins in the complex suggesting a functional module.
Significantly enriched Gene Ontology annotations for proteins in the synaptic bouton complex shows enrichment for neuron development and synaptic transmission.
Complex map coverage of Human Protein Atlas RNA tissue specificity classifications showing majority of complexes are ubiquitously expressed and likely core cellular machinery.
Fraction of complexes with significantly enriched annotation terms (g:Profiler hypergeometric test with FDR (Benjamini–Hochberg) correction on each complex and further corrected at an FDR of 5% given a set of shuffled complexes; see Materials and Methods) from various ontologies.
Protein coverage of high‐level Disease Ontology terms and cilia‐related annotations for complex map as well as three published maps (Wan et al, BioPlex, Boldt et al) and three published interaction network (Hein et al, Gupta et al and Boldt et al).
Cystic kidney phenotype represented by polycystic kidneys from patient with OFD1 variant, adapted from Chetty‐John et al (2010).
Digit malformations represented by polydactyly of Bardet–Biedl syndrome patient with LZTFL1 (BBS17) variant, adapted from Schaefer et al (2014).
Short‐rib phenotype represented by chest narrowing of Jeune asphyxiating thoracic dystrophy individual with IFT80 variant, adapted from Beales et al (2007).
Maculopathy represented by retinitis pigmentosa of Senior–Loken syndrome patient with mutation in WDR19 (Coussa et al, 2013).
Network of ciliopathy complex and closely interacting centrosomal complex. Edge weights represent SVM confidence scores where gray are intracomplex edges and purple are inter‐complex edges. Color of nodes follows Fig 5 conventions.
Matrix of AP‐MS evidence supporting both complexes. The matrix shows strong support for interactions within each complex. Bait proteins that are members of either complex are labeled on the left.
Experimental validation of ciliary proteins using multi‐ciliated epithelial cells in Xenopus laevis. Localization assays for the three uncharacterized proteins in the OFD1 complex confirm that all three proteins localize to basal bodies at the base of the cilia in a manner similar to known components of the complex. Scale bars: 1 μm. Each image is representative of nine cells from three different embryos.
Network view of IFT‐A and IFT‐B complexes. Node colors follow Fig 5 conventions.
Matrix of AP‐MS experiments shows IFT‐A and IFT‐B are well separated and supported by multiple experiments.
Network view of two IFT sub‐complexes associated with ANKRD55.
Matrix of AP‐MS experiments shows strong support for ANKRD55 association with known IFT proteins.
ANKRD55 localizes to cilia as predicted from co‐complex interactions, as assayed in vivo in multi‐ciliated Xenopus laevis epithelial cells. Scale bar: 10 μm. Each image is representative of 18 cells from six different embryos. Kymograph of ANKRD55 localized to cilia in vivo reveals rapid trafficking along the length of the cilia (representative out of 36 multi‐ciliated cells).
Morpholino knockdown of ANKRD55 results in reduced count and length of cilia, in a manner similar to the control IFT52 knockdown, supporting a role in ciliogenesis for ANKRD55. Scale bar: 10 μm. Each image is representative of 18 cells from six different embryos.
Dorsal view of stage 19 X. laevis embryos displays that ANKRD55 knockdown causes neural tube closure defects that are rescued by wild‐type ANKRD55 mRNA. The Tukey box plot displays average distance between neural folds in control (n = 32), morphant (n = 22), and rescue (n = 24) embryos. ***P < 0.0001, two‐sample Kolmogorov–Smirnov test. Boxes indicate median (inner band), first quartile (bottom) and third (top) quartile. Whiskers indicate 1.5 interquartile range. Dots indicate outliers.
Two‐color kymograph generated by co‐expression of ANKRD55‐GFP (green) and mCherry‐CLUAP1 (magenta) reveals that ANKRD55 travels along axonemes in association with other IFT proteins. Scale bar: 10 μm. Kymograph is representative out of 22 multi‐ciliated cells.
RT–PCR demonstrates the efficiency of ANKRD55 MO to disrupt splicing of ANKRD55 mRNA in Xenopus embryos. GAPDH is used as a control.
Morpholino knockdown of JBTS17, known to specifically affect IFT‐B localization, results in the accumulation of ANKRD55‐GFP in axonemes (green: ANKRD55‐GFP, magenta: membrane RFP). Scale bar: 10 μm. Each image is representative of 18 cells from six different embryos.
References
-
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel‐Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29 - PMC - PubMed
-
- Bader GD, Hogue CWV (2002) Analyzing yeast protein‐protein interaction data obtained from different sources. Nat Biotechnol 20: 991–997 - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
