Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Nov 4;15(11):4126-4134.
doi: 10.1021/acs.jproteome.6b00095. Epub 2016 Jul 19.

Data-Driven Approach To Determine Popular Proteins for Targeted Proteomics Translation of Six Organ Systems

Affiliations

Data-Driven Approach To Determine Popular Proteins for Targeted Proteomics Translation of Six Organ Systems

Maggie P Y Lam et al. J Proteome Res. .

Abstract

Amidst the proteomes of human tissues lie subsets of proteins that are closely involved in conserved pathophysiological processes. Much of biomedical research concerns interrogating disease signature proteins and defining their roles in disease mechanisms. With advances in proteomics technologies, it is now feasible to develop targeted proteomics assays that can accurately quantify protein abundance as well as their post-translational modifications; however, with rapidly accumulating number of studies implicating proteins in diseases, current resources are insufficient to target every protein without judiciously prioritizing the proteins with high significance and impact for assay development. We describe here a data science method to prioritize and expedite assay development on high-impact proteins across research fields by leveraging the biomedical literature record to rank and normalize proteins that are popularly and preferentially published by biomedical researchers. We demonstrate this method by finding priority proteins across six major physiological systems (cardiovascular, cerebral, hepatic, renal, pulmonary, and intestinal). The described method is data-driven and builds upon the collective knowledge of previous publications referenced on PubMed to lend objectivity to target selection. The method and resulting popular protein lists may also be useful for exploring biological processes associated with various physiological systems and research topics, in addition to benefiting ongoing efforts to facilitate the broad translation of proteomics technologies.

Keywords: bibliometrics; common proteins; data science; human tissue convergence; proteomics translation; semantics; targeted proteomics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Topic-specific publication counts in six major organ systems. (A) Computational workflow to automatically derive the number of publications referenced to proteins in our custom PubMed queries. System-relevant publications are retrieved from PubMed with specific search terms (List 1). A cross-reference between PMIDs and GeneIDs is retrieved from the NCBI FTP Web site (List 2). A custom software tool suite matches List 2 to List 1. The software counts the unique occurrences of each protein in each year of a user-specified species and converts GeneID to UniProt/SwissProt accessions. Lastly, the software computes the normalized copublication distance (NCD) between a protein with the queried topic. (B) Summary statistics of mouse (orange) and human (blue) proteins referenced to publications in each system. Left: the total number referenced publications. Right: the total number of distinct proteins with at least five publications for each system. (C) The number of topic-specific publications per protein resembles a logarithm–logarithm relationship with regard to protein rank. The number of referenced article decreases sharply after the top 50 proteins in the queried tissues, with the next 50 proteins accounting for approximately one-third of publications as the first 50.
Figure 2
Figure 2
Identifying topic-relevant significant proteins using normalized copublication distance (NCD). (A) The multiplicity of occurrence of the top 50 proteins in each of the six examined systems is shown. Proteins with a multiplicity of six (e.g., TP53) are found in the top 50 most published proteins in all six examined systems and are colored in dark brown. Proteins with a multiplicity of one (e.g., BDNF in the cerebral system) are in the top 50 in only one of the six organ systems queried. (B) NCD normalizes the number of referenced publications in a particular topic for a particular protein by the total number of referenced publications of that protein to any topic. In contrast with ranking proteins by total publication count, normalized copublication distance down-ranks proteins that are of general interest with large numbers of publications in multiple fields (e.g., certain proteins in tumorigenesis pathways) and promotes query-specific proteins, such that top-ranked proteins by NCD (right) are mostly organ-specific. (C) The distribution of NCD values for proteins in a query (black line) follows a normal Gaussian distribution (red line), with a mean of 1.0 and standard deviation (sd) of 0.1. (D) The graphs show the number of publications for each protein referenced in a queried tissue (ordinates), plotted against the NCD between the protein and the tissue (abscissae). Line and shade: locally weighted scatterplot smoothing regression and 95% confidence interval thereof, respectively. Proteins with significant NCD (Z ≤ −1.96) are colored in blue. The top protein in each query is labeled in red text.
Figure 3
Figure 3
High-impact proteins in six organ systems in (A) human and (B) mouse. The gene names and protein names of the top five proteins, as determined by their normalized copublication distance within the queried organ system in the literature, are shown. The identities of the top proteins in each system indicate both organ-system-specific as well as species-specific differences in the focus of biomedical research.
Figure 4
Figure 4
Popular protein networks. (A) Pairwise normalized copublication distance matrices of top proteins in the cardiovascular and the cerebral system are shown. Cells in the heat map represent the normalized copublication distance between each protein–protein pair via their copublication history (red: greater number of copublications). Proteins may be clustered into identifiable pathways that are known to play significant roles in the physiology of each system, as shown on the left, suggesting the described method of using literature records to identify essential protein readily recapitulates known biology (PB–H: Benjamini–Hochberg adjusted P value of enrichment). (B) Proximal proteins of ten of the top proteins in the cardiovascular system are visualized in protein–protein interaction network graphs. The color of each node denotes the normalized copublication distance of a protein to cardiovascular research, where darker colors denote a protein is more preferentially found in cardiovascular publications compared with other fields. The size of nodes denotes publication counts in cardiovascular-relevant publications; size increases with increasing publication count. Selected hub genes and highly published cardiovascular proteins are labeled in black; in addition, proteins in the network with fewer than 10 publications are labeled in blue and represent proteins that are associated with popular proteins via protein–protein interactions but are themselves yet to be heavily investigated.

References

    1. Grote E, Fu Q, Ji W, Liu X, Van Eyk JE. Using pure protein to build a multiple reaction monitoring mass spectrometry assay for targeted detection and quantitation. Methods Mol. Biol. 2013;1005:199–213. - PubMed
    1. Li X.-j., Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McLean M, Law S, Butler H, Schirm M, Gingras O, Lamontagne J, Allard R, Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P. A blood-based proteomic classifier for the molecular characterization of pulmonary nodules. Sci. Transl. Med. 2013;5:207ra142. - PMC - PubMed
    1. Huttenhain R, Soste M, Selevsek N, Rost H, Sethi A, Carapito C, Farrah T, Deutsch EW, Kusebauch U, Moritz RL, Nimeus-Malmstrom E, Rinner O, Aebersold R. Reproducible quantification of cancer-associated proteins in body fluids using targeted proteomics. Sci. Transl. Med. 2012;4:142ra94–142ra94. - PMC - PubMed
    1. Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The PeptideAtlas project. Nucleic Acids Res. 2006;34:D655–8. - PMC - PubMed
    1. Nanjappa V, Thomas JK, Marimuthu A, Muthusamy B, Radhakrishnan A, Sharma R, Ahmad Khan A, Balakrishnan L, Sahasrabuddhe NA, Kumar S, Jhaveri BN, Sheth KV, Kumar Khatana R, Shaw PG, Srikanth SM, Mathur PP, Shankar S, Nagaraja D, Christopher R, Mathivanan S, Raju R, Sirdeshmukh R, Chatterjee A, Simpson RJ, Harsha HC, Pandey A, Prasad TS. Plasma Proteome Database as a resource for proteomics research: 2014 update. Nucleic Acids Res. 2014;42:D959–65. - PMC - PubMed

Publication types

LinkOut - more resources