Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 24;7(4):412-421.e5.
doi: 10.1016/j.cels.2018.08.004. Epub 2018 Aug 29.

Assembling the Community-Scale Discoverable Human Proteome

Affiliations

Assembling the Community-Scale Discoverable Human Proteome

Mingxun Wang et al. Cell Syst. .

Abstract

The increasing throughput and sharing of proteomics mass spectrometry data have now yielded over one-third of a million public mass spectrometry runs. However, these discoveries are not continuously aggregated in an open and error-controlled manner, which limits their utility. To facilitate the reusability of these data, we built the MassIVE Knowledge Base (MassIVE-KB), a community-wide, continuously updating knowledge base that aggregates proteomics mass spectrometry discoveries into an open reusable format with full provenance information for community scrutiny. Reusing >31 TB of public human data stored in a mass spectrometry interactive virtual environment (MassIVE), the MassIVE-KB contains >2.1 million precursors from 19,610 proteins (48% larger than before; 97% of the total) and doubles proteome coverage to 6 million amino acids (54% of the proteome) with strict library-scale false discovery controls, thereby providing evidence for 430 proteins for which sufficient protein-level evidence was previously missing. Furthermore, MassIVE-KB can inform experimental design, helps identify and quantify new data, and provides tools for community construction of specialized spectral libraries.

Keywords: algorithms; big data; knowledge base; proteomics; repositories; spectral libraries; tandem mass spectrometry.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS

N.B. was a co-founder, had an equity interest, and received income from Digital Proteomics, LLC through 2017. The terms of this arrangement have been reviewed and approved by the University of California, San Diego, in accordance with its conflict of interest policies. Digital Proteomics was not involved in the research presented here.

Figures

Figure 1.
Figure 1.. Overview of the MassIVE-KB
Representation of the interactions between the proteomics community, public datasets, MassIVE Search, and MassIVE-KB derived from over 31.4 TB of MS data with 658 million MS/MS spectra yielding 191 million PSMs aggregated into a spectral library of 2.1 million precursors; new datasets will be incrementally added to MassIVE-KB upon deposition in ProteomeXchange repositories. This platform promotes reutilization of proteomics MS big data in three major ways: (1) increasing identifications in new liquid chromatography-tandem mass spectrometry (LC-MS/MS) data by spectral library search against a community-scale reference collection, (2) validation of extraordinary results against reference library spectra and all identified spectra found in all public data, and (3) supporting experimental design with reference spectra and detailed records of community-wide occurrences (e.g., designing targeted proteomics experiment). Furthermore, all entries in MassIVE-KB are supported by open provenance records that can be scrutinized by the community to inspect the series of events that led to specific identifications and reference spectra.
Figure 2.
Figure 2.. Step-by-Step Overview of the Spectral Library Construction Procedure
(A)Datasets of varying complexity are first searched with workflows designed to appropriately capture the effective peptide search space defined by experimental metadata; false discovery rates (FDRs) are controlled per search, and PSMs are aggregated together and filtered by spectrum quality. (B)Precursor-level FDR is applied to filter candidate spectra for inclusion in the library. (C)The most similar spectrum to all other replicates is chosen as each precursor’s representative spectrum and (D) ambiguously identified spectra mapping to more than one peptide sequence are removed. (E) The MassIVE-KB spectral library can be readily applied to new searches at the Center for Computational Mass Spectrometry’s online workflow engine for data-dependent acquisition (DDA) library searching (MSPLIT), DIA peptide identification (MSPLIT-DIA), and spectral networks analysis (Maestro). In addition, MassIVE-KB can also be consulted online and is freely available for download in formats compatible with popular third-party tools such as the Trans-Proteomics Pipeline (Deutsch et al., 2010), SpectraST (Lam et al., 2007), and Skyline (MacLean et al., 2010).
Figure 3.
Figure 3.. MassIVE-KB Coverage of the Human Proteome Reveals Hundreds of Novel Proteins
(A) The MassIVE-KB HCD spectral library substantially expands over NIST’s HCD library (the largest comparable library) by including spectra from 4.3× more precursors from 3.4× more peptide sequences covering over twice as many amino acids in the human proteome. Furthermore, unlikeother available spectral library resources, MassIVE-KB explicitly controls library-level FDR at the precursor and protein levels, as well as provides comprehensive provenance information for every single library spectrum. (B)MassIVE-KB enabled the detection of 430 novel proteins—neXtProt PE2-5 broken down in pie chart insert, PE legend in part (C)–with at least two nonoverlapping sequences (stricter requirements than the standard HUPO identification guidelines [Deutsch et al., 2016]); the majority of these proteins were already observed with transcriptomics evidence (PE2, 380 proteins). All MassIVE-KB peptides uniquely mapped to novel proteins matched Proteome Tools synthetic peptides with highly correlated fragmentation patterns (0.93 median cosine) for all sequences included in their set. (C)ProteomeTools synthetic peptides fully confirmed 291 MassIVE-KB novel proteins (162 proteins with 2+ sequences). (D)Even with the marked MassIVE-KB gain in coverage of the human proteome, there remain 4.1 million “coverable” amino acids in fully tryptic sequences with ≤40 aminoacids; 3.4 million of these uncovered amino acids regions are in proteins that have been observed in non-synthetic data(i.e., missing coverage on known proteins) while 695 thousand amino acids remain uncovered in proteins that have not been observed with unique peptides from non-synthetic data.
Figure 4.
Figure 4.. Extensive Coverage of the Human Proteome Capitalizes on Diverse Contributions from Multiple Sources as Various Datasets in the Community Contribute their Own Specific Insights into the Human Proteome
The left axis shows the number of new proteins contributed by each dataset in the bottom axis (yellow bar if identified with two non-overlapping peptides, and blue if identified by only one peptide); the right axis shows the cumulative number of protein observations (starting at 15,000 for legibility; see Table S2 for details); the x axis represents all datasets included in MassIVE-KB sorted by the total proteins covered in decreasing order. Three foundational datasets (i.e., Bioplex(Huttlin et al., 2015) and the two draft proteomes [Kim et al., 2014; Wilhelm et al., 2014]) provide deep coverage of commonly observed proteins, but several smaller datasets were key in contributing unique proteins, such as the ProteomeXchange:PXD004927 small ubiquitin-like modifier (SUMO) proteome (Hendriks et al., 2017) and ProteomeXchange:PXD003947 spermatozoa (Vandenbrouck et al., 2016) proteins. Altogether, the union of community datasets (excluding synthetic peptides) supported the observation of 16,801 human proteins (83% of the human proteome).
Figure 5.
Figure 5.. MassIVE-KB Supports Experimental Design with Frequency of Occurrences across Datasets
(A) The number of times a protein was observed in all MassIVE-KB searches; this varied from a single search to over 6,000, with a median of 48 occurrences. While many proteins are commonly identified in multiple searches from various datasets, there are also thousands of proteins that are only identified in a very small number of datasets corresponding to specific experimental procedures (e.g., affinity purification) or biological conditions (e.g., tissue). (B)The number of proteotypic precursors selected such that at least one precursor is present ≥90% of the time when a protein is observed. For most proteins, four or fewer precursors are sufficient to observe a protein in 90% of its occurrences (thus reinforcing the expectation that peptide observations should be generally consistent across datasets). Upon closer investigation, the tail of the distribution further revealed that for a subset of datasets where proteotypic peptides differed drastically from the norm, the discrepancy could be partially attributed to experimental procedures for peptide enrichment (e.g., phosphorylation and SUMOylation). (C) Out of the 10,186 protein-protein interactions between primary SwissProt isoforms reported in Bioplex 1.0 that were also co-identified in the MassIVE-KB reanalysis of Bioplex data, we found that 9,861 interactions (96.8%) would continue to be detectable using only the MassIVE-KB set of proteotypic precursors.

Comment in

References

    1. Bairoch A, and Apweiler R (2000). The SWISS-PRO T protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28, 45–48. - PMC - PubMed
    1. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, and Xenarios I (2016). Uniprotkb/swiss-prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol. Biol 1374, 23–54. - PubMed
    1. Chick JM, Kolippakkam D, Nusinow DP, Zhai B, Rad R, Huttlin EL, and Gygi SP (2015). A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol 33, 743–749. - PMC - PubMed
    1. Craig R, Cortens JC, Fenyo D, and Beavis RC (2006). Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res 5, 1843–1849. - PubMed
    1. Deeb SJ, D’Souza RC, Cox J, Schmidt-Supprian M, and Mann M (2012). Super-SILAC allows classification of diffuse large B-cell lymphoma subtypes bytheir protein expression profiles. Mol. Cell. Proteomics 11,77–89. - PMC - PubMed

Publication types