A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas

Terry Farrah¹, Eric W Deutsch, Gilbert S Omenn, David S Campbell, Zhi Sun, Julie A Bletz, Parag Mallick, Jonathan E Katz, Johan Malmström, Reto Ossola, Julian D Watts, Biaoyang Lin, Hui Zhang, Robert L Moritz, Ruedi Aebersold

Affiliations

PMID: 21632744
PMCID: PMC3186192
DOI: 10.1074/mcp.M110.006353

A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas

Terry Farrah et al. Mol Cell Proteomics. 2011 Sep.

. 2011 Sep;10(9):M110.006353.

doi: 10.1074/mcp.M110.006353. Epub 2011 Jun 1.

Authors

Affiliation

¹ Institute for Systems Biology, Seattle, WA 98109, USA. tfarrah@systemsbiology.org

PMID: 21632744
PMCID: PMC3186192
DOI: 10.1074/mcp.M110.006353

Abstract

Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world.

PubMed Disclaimer

Figures

**Fig. 1.**
*Left*: Search, analysis, and validation steps for each LC-MS/MS experiment. Spectra were searched against a spectral library or sequence database. The resulting PSMs were then processed using the TPP, including a new component, iProphet, to improve discrimination (see text for details). *Right*: The PeptideAtlas build process. ProteinProphet combines PSMs passing the FDR threshold for all experiments to create lists of distinct peptides, protein identifications, and protein groups. These data, along with supporting information such as consensus spectra, genome mappings, and proteotypic peptides, comprise a PeptideAtlas build.

**Fig. 2.**
**A, Six shaded bars (two of which overlap) represent sets of protein identifications at various levels of redundancy under the Cedar scheme.** Tallies are for the Human Plasma PeptideAtlas. Beginning at bottom: ●*Exhaustive* set: contains any protein sequence in the atlas' combined protein sequence database (Swiss-Prot 2010–04 + IPI v3.71 + Ensembl v57.37) that includes at least one identified peptide. ●*Sequence-unique* set: exhaustive set with exact duplicates removed. ●*Peptide-set-unique* set: a subset of the sequence-unique set within which no two protein sequences include the exact same set of identified peptides. ●*Not subsumed* set: peptide-set-unique set with subsumed protein sequences removed (those for which the identified peptides form a proper subset of the identified peptides for another protein sequence). ●*Canonical* set: a subset of the not subsumed set within which no protein sequence includes more than 80% of the peptides of any other member of the set. Protein sequences that are not subsumed, but not canonical are called *possibly distinguished*, because each has a peptide set that is close, but not identical, to that of a canonical protein sequence. ●*Covering* set: a minimal set of protein sequences that can explain all of the identified peptides. B, Peptide-centric illustration of six protein sequences in a hypothetical ProteinProphet protein group, in order of descending ProteinProphet probability. Heavy lines represent protein chains (with invented identifiers); lighter lines represent observed peptides. Vertically aligned peptides are identical in sequence, and one instance of each is labeled with the letter of the highest probability protein to which it maps. A' is indistinguishable from A because it contains exactly the same set of observed peptides; both are equally likely to exist in the sample(s), but A is labeled *canonical* because its Swiss-Prot protein identifier is preferred. E is *subsumed* by A because its observed peptides form a subset of A's peptides; it is also subsumed by A', C, and D. Protein sequences B, C, and D are labeled *possibly distinguished* because the peptide set for each is slightly different from that of A. The three protein sequences with superscript C comprise the smallest subset of sequences sufficient to explain all the observed peptides in the group, and thus belong to the *covering* set.

**Fig. 3.**
Plasma protein concentrations determined using immunoassay and antibody microarray analysis (40) *versus* normalized spectral counts from the Human Plasma *Non-glyco* PeptideAtlas, plotted on a log scale. Each small square represents a protein found in both sources. Hollow squares represent proteins that were excluded when drawing the trend line (either depleted (albumin) or fewer than four spectrum counts). The line segments above and below the trend line are fit to the standard deviation of the y axis values computed at intervals of 0.1 (log scale). The arrows on the *left* represent proteins with reported concentrations in (40) but no spectrum counts. The histogram at the *right* depicts an estimate of the completeness of the Human Plasma Non-glyco PeptideAtlas as a function of concentration, calculated as the number of points divided by the total number of points and arrows within each decade. See supplemental Fig. S2, for *N-Glyco* atlas.

**Fig. 4.**
**Proteins identified by each experiment.** Each bar represents one of the 91 experiments, ordered as in supplemental Table S4. Height of dark bar = canonical protein sequences identified per experiment; total height (dark + light) = cumulative tally; width of bar = PSM count. See supplemental Fig. S5, for a similar graph of distinct peptides.

See this image and copyright information in PMC

References

1. Putnam F. W. ed. (1975–1989) The Plasma Proteins, 2nd Ed., Academic Press, New York
1. Anderson N. L., Anderson N. G. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell Proteomics 1, 845–867 - PubMed
1. Kersey P. J., Duarte J., Williams A., Karavidopoulou Y., Birney E., Apweiler R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988 - PubMed
1. Omenn G. S., States D. J., Adamski M., Blackwell T. W., Menon R., Hermjakob H., Apweiler R., Haab B. B., Simpson R. J., Eddes J. S., Kapp E. A., Moritz R. L., Chan D. W., Rai A. J., Admon A., Aebersold R., Eng J., Hancock W. S., Hefta S. A., Meyer H., Paik Y. K., Yoo J. S., Ping P., Pounds J., Adkins J., Qian X., Wang R., Wasinger V., Wu C. Y., Zhao X., Zeng R., Archakov A., Tsugita A., Beer I., Pandey A., Pisano M., Andrews P., Tammen H., Speicher D. W., Hanash S. M. (2005) Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database. Proteomics 5, 3226–3245 - PubMed
1. Omenn G. Editor (2006) Exploring the Human Plasma Proteome, Wiley-VCH, New York, NY

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas

Affiliation

A high-confidence human plasma proteome reference set with estimated concentrations in PeptideAtlas

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources