Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May;17(5):e10016.
doi: 10.15252/msb.202010016.

hu.MAP 2.0: integration of over 15,000 proteomic experiments builds a global compendium of human multiprotein assemblies

Affiliations

hu.MAP 2.0: integration of over 15,000 proteomic experiments builds a global compendium of human multiprotein assemblies

Kevin Drew et al. Mol Syst Biol. 2021 May.

Abstract

A general principle of biology is the self-assembly of proteins into functional complexes. Characterizing their composition is, therefore, required for our understanding of cellular functions. Unfortunately, we lack knowledge of the comprehensive set of identities of protein complexes in human cells. To address this gap, we developed a machine learning framework to identify protein complexes in over 15,000 mass spectrometry experiments which resulted in the identification of nearly 7,000 physical assemblies. We show our resource, hu.MAP 2.0, is more accurate and comprehensive than previous state of the art high-throughput protein complex resources and gives rise to many new hypotheses, including for 274 completely uncharacterized proteins. Further, we identify 253 promiscuous proteins that participate in multiple complexes pointing to possible moonlighting roles. We have made hu.MAP 2.0 easily searchable in a web interface (http://humap2.proteincomplexes.org/), which will be a valuable resource for researchers across a broad range of interests including systems biology, structural biology, and molecular explanations of disease.

Keywords: data integration; human protein complexes; mass spectrometry; moonlighting proteins.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Figure 1
Figure 1. Machine learning framework to identify protein complexes
  1. Graphical description of computational pipeline to integrate > 15,000 mass spectrometry experiments. Number of experiments used is listed next to each technique (see also Table 1). A Support Vector Machine (SVM) classifier was trained using numerical measures (i.e., features) on pairs of proteins calculated from original mass spectrometry data and training labels from literature‐curated complexes (CORUM). The classifier was then used to construct a protein interaction network by calculating a confidence score for all pairs of proteins for their propensity to interact. Clustering parameters were then learned from training complexes, and five final sets of clusters were chosen ranked in order of confidence from “Extremely High” to “Medium”. The union of these selected clusterings represents the final set of hu.MAP 2.0 complexes. Networks of previously known protein complexes identified by this pipeline which were not in the training set of complexes are shown as positive control examples.

  2. “UpSet” plot (Lex et al, 2014) displaying the intersections of protein pairs for all integrated datasets. Each set of connected black dots represents the intersection of the respective datasets. Vertical bar plot displays protein pair count of intersection. Light gray dots are datasets not included in the intersection. Single unconnected black dots represent protein pairs that are only present in a single dataset. Horizontal bar plot represents total protein pair count in each dataset. The plot shows the Weighted Matrix Model (single black WMM dot) provides additional information for many pairs of proteins (> 2.2 × 10E6) that would be limited otherwise.

Figure 2
Figure 2. hu.MAP 2.0 outperforms previous complex maps
  1. Precision‐Recall (PR) plot evaluated on a test (leave‐out) set of literature‐curated co‐complex pairwise protein interactions. The plot shows hu.MAP 2.0 is more accurate and comprehensive than previous published datasets. The plot also evaluates the performance of predictions without the Weighted Matrix Model (WMM) and shows the WMM substantially improves performance.

  2. Clustering workflow used to identify protein complexes in hu.MAP 2.0 protein interaction network. The network is first filtered based on the confidence score produced by the Support Vector Machine (SVM). The filtered network is then clustered using a two‐stage approach, clustering first using ClusterOne, and then further clustering with MCL. The resulting clusters are then evaluated using the k‐clique method (see Materials and Methods) on training complexes. Approximately 1,700 parameter combinations were evaluated, each producing a unique set of clusters, sweeping SVM score filter thresholds, and clustering parameters (i.e., ClusterOne Max Overlap, ClusterOne Density, and MCL Inflation).

  3. k‐clique Precision‐Recall (kPR) scatter plot of 1,700 clustering parameter sets. Five clusterings (colored hollow circles) were selected representing varying degrees of confidence balancing the trade‐off between precision and recall. The five selected clusterings were combined as a final set of clusters (orange filled circle).

  4. kPR scatter plot of hu.MAP 2.0 complexes (orange filled circle) and other published complex maps (colored filled circles) evaluated on a test set of literature‐curated complexes. hu.MAP 2.0 complexes increase in both precision and recall relative to other maps. Also plotted are the five sets of complexes at different levels of confidence (colored hollow circles) demonstrating consistency between the level of confidence determined from training set (B) and test set.

Figure EV1
Figure EV1. Density plot of Human Protein Atlas (Uhlén et al, 2015) transcript expression across tissues of promiscuous proteins versus non‐promiscuous proteins
We observe negligible differences between promiscuous and non‐promiscuous distributions suggesting expression levels are not a factor contributing to the identification of promiscuous proteins.
Figure 3
Figure 3. hu.MAP 2.0 complexes identify promiscuous proteins
  1. A, B

    Multifunctional protein HSPA9 participates in two distinct complexes, HuMAP2_00358 (A) and HuMAP2_01130 (B). HuMAP2_00358 (turquoise, “high” confidence) is enriched for Reactome annotation “Mitochondrial protein import”, a known function of HSPA9. HuMAP2_01130 (blue, “very high” confidence) is enriched for Reactome annotation “Regulation of HSF1‐mediated heat shock response”, another known function of HSPA9. Weight of network edges represent confidence of interactions.

  2. C

    Sparkline elution profiles from two orthogonal biochemical fractionation experiments. HEK 293 cell lysate was separated using a mixed bed ion‐exchange column and Drosophila melanogaster embryo lysate was separated using a heparin column (Wan et al, 2015). HSPA9 elutes in two distinct peaks (shaded) which co‐elute with members of the two complexes. X‐axis represents fraction collected along biochemical separation. Y‐axis for each row represents observed protein abundance.

  3. D

    Promiscuous proteins are older on average than single complex proteins. Z‐scores for each protein age group were determined by comparing the number of promiscuous proteins to a randomly sampled background set consisting of non‐promiscuous proteins (i.e., participating in only one complex).

Figure EV2
Figure EV2. Annotation enrichment of older promiscuous proteins
gProfiler output shows older promiscuous proteins are enriched for metabolic processing annotations.
Figure EV3
Figure EV3. hu.MAP 2.0 complexes are functionally enriched
The bar chart shows the number of identified complexes that are enriched with at least one annotation from GO, Reactome, CORUM, KEGG, or Human Phenotype Ontology (HP) at an FDR threshold of 0.05.
Figure 4
Figure 4. Transfer of function annotations to uncharacterized proteins
  1. SETD3 and CMTR1 are identified as co‐complex interactors with the Ribonuclease H2 complex which provides a possible mechanistic explanation for their role in viral infection.

  2. Sparkline elution profiles from multiple orthogonal co‐fractionation experiments demonstrate a strong degree of co‐elution among subunits in the SETD3‐CMTR1‐RNAse H2 complex. Weight of network edges represents confidence of interactions. X‐axis represents fraction collected along biochemical separation. Y‐axis for each row represents observed protein abundance.

  3. The uncharacterized protein, C7orf26, is identified as part of the Integrator complex.

  4. Sparkline elution profiles show a high degree of correlation between C7orf26 and subunits of the Integrator complex from multiple orthogonal co‐fractionation experiments.

  5. The association of C7orf26 and Integrator complex is additionally supported by affinity purification mass spectrometry (AP‐MS) experiment where C7orf26 is pulled down with Integrator subunit baits.

  6. The uncharacterized protein, CCDC9, is identified as co‐complex with the exon–exon junction complex (EJC), a ribonucleoprotein complex involved in splicing.

  7. Sparkline elution profiles from the independently collected RNA DIF‐FRAC size exclusion chromatography (SEC) experiment show CCDC9 co‐elutes with known subunits of the EJC when RNA is present (black). The elution profiles also show CCDC9 is sensitive to RNAse A treatment (shift of elution peak between black and red profiles) as are the subunits of the EJC further supporting CCDC9's participation in this known ribonucleoprotein complex.

Figure 5
Figure 5. Protein complex map coverage across Human Protein Atlas tissues and cell specificity
  1. Coverage of all human proteins shows a broad distribution of proteins classified into a range of specificity classes, from detected in all tissues and cells to detected in only a single tissue or cell type.

  2. Coverage of hu.MAP 1.0 proteins show a narrower distribution of proteins classified into specificity classes with the majority of proteins detected in many or all tissues and cell types. This suggests hu.MAP 1.0 represented the core cellular machinery.

  3. Coverage of hu.MAP 2.0 proteins show a distribution representative of the core cellular machinery shared among all or many tissue and cell types but also shows an increase in cell type specificity with gains in proteins that are only detected in some tissues/cell types.

Figure EV4
Figure EV4. SVM confidence score versus test set precision
The line plot shows the relationship between the SVM confidence score, and the empirical precision value calculated from the test set of protein interactions. The relationship shows the precision value is consistently higher than the confidence score.

References

    1. Alberts B (1998) The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 92: 291–294 - PubMed
    1. Arkin MR, Tang Y, Wells JA (2014) Small‐molecule inhibitors of protein‐protein interactions: progressing towards the reality. Chem Biol 21: 1102–1114 - PMC - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29 - PMC - PubMed
    1. Baltz A, Munschauer M, Schwanhäusser B, Vasile A, Murakawa Y, Schueler M, Youngs N, Penfold‐Brown D, Drew K, Milek M et al (2012) The mRNA‐bound proteome and its global occupancy profile on protein‐coding transcripts. Mol Cell 46: 674–690 - PubMed
    1. Boldt K, van Reeuwijk J, Lu Q, Koutroumpas K, Nguyen T‐M, Texier Y, van Beersum SEC, Horn N, Willer JR, Mans DA et al (2016) An organelle‐specific protein landscape identifies novel diseases and molecular mechanisms. Nat Commun 7: 11491 - PMC - PubMed

Publication types

Substances