. 2016 Apr 5:2016:baw030.

doi: 10.1093/database/baw030. Print 2016.

Genic insights from integrated human proteomics in GeneCards

Simon Fishilevich¹, Shahar Zimmerman², Asher Kohn³, Tsippi Iny Stein², Tsviya Olender², Eugene Kolker⁴, Marilyn Safran², Doron Lancet²

Affiliations

¹ Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel simon.fishilevich@weizmann.ac.il.
² Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel.
³ LifeMap Sciences Ltd., Tel Aviv 69710, Israel.
⁴ CDO Analytics, Seattle Children's Hospital, Seattle, WA 98101 USA Bioinformatics and High-Throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, WA 98101 USA Data-Enabled Life Sciences Alliance (DELSA), Seattle, Washington, 98101, USA Departments of Biomedical Informatics and Medical Education and Pediatrics, University of Washington School of Medicine, Seattle, WA 98109, USA Department of Chemistry and Chemical Biology, Northeastern University College of Science, Boston, MA 02115 USA.

PMID: 27048349
PMCID: PMC4820835
DOI: 10.1093/database/baw030

Genic insights from integrated human proteomics in GeneCards

Simon Fishilevich et al. Database (Oxford). 2016.

. 2016 Apr 5:2016:baw030.

doi: 10.1093/database/baw030. Print 2016.

Authors

Simon Fishilevich¹, Shahar Zimmerman², Asher Kohn³, Tsippi Iny Stein², Tsviya Olender², Eugene Kolker⁴, Marilyn Safran², Doron Lancet²

Affiliations

¹ Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel simon.fishilevich@weizmann.ac.il.
² Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel.
³ LifeMap Sciences Ltd., Tel Aviv 69710, Israel.
⁴ CDO Analytics, Seattle Children's Hospital, Seattle, WA 98101 USA Bioinformatics and High-Throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, WA 98101 USA Data-Enabled Life Sciences Alliance (DELSA), Seattle, Washington, 98101, USA Departments of Biomedical Informatics and Medical Education and Pediatrics, University of Washington School of Medicine, Seattle, WA 98109, USA Department of Chemistry and Chemical Biology, Northeastern University College of Science, Boston, MA 02115 USA.

PMID: 27048349
PMCID: PMC4820835
DOI: 10.1093/database/baw030

Abstract

GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/.

PubMed Disclaimer

Figures

**Figure 1.**
HIPED—Human Integrated Protein Expression Database. (A) HIPED architecture scheme. (B) Classification of the 771 proteomes in HIPED. (C) Gene and sample type counts of HIPED and its mined components. Sample types are unique normal anatomical entities or cell lines represented in each source. (D) Gene content overlap of HIPED mined sources.

**Figure 2.**
Protein abundance distribution (PPM values) of mined datasets in HIPED. (A) All 771 samples comprising HIPED. (B) Selected sample groups of similar anatomical entities.

**Figure 3.**
Averaged protein expression vectors. Representation of selected 53 genes averaged protein abundance vectors for the 69 anatomical entities in HIPED.

**Figure 4.**
Double hierarchical clustering of the 16 900 genes in 69 normal anatomical entities. Examples of gene groups sharing functional annotations are highlighted. (A) CNS—397 genes enriched with diseases as schizophrenia, pathways as neuroscience and GO terms as transporter activity. (B) Blood—301 genes enriched with diseases such as obesity and C2 deficiency and GO terms as complement activation (C) Immune system—483 genes enriched with diseases such as rheumatoid arthritis and pathways as lymphocyte signaling. (D) Genes with housekeeping properties—1771 genes enriched with pathways and GO terms related to metabolism and gene expression. See Supplementary Tables S5–S8 for the full enrichment analysis data.

**Figure 5.**
K-means analysis 53 clusters of the 16 900 genes in normal human proteomes.

**Figure 6.**
PCA of 16 900 genes comprising HIPED normal proteomes. (A) Gene expression breadth. Expression breadth is one of the gene expression vector signatures determining its position in the PCA space. This feature is closely related to the first component of the PCA. (B) Subcellular localization. Subcellular localization data from COMPARTMENTS (31) was projected on the gene expression space. Only genes having the maximal confidence score of 5 for a single subcellular compartment are shown. (C) Single tissue and housekeeping genes. All 2320 genes expressed in a single anatomical entity are shown, representing the tissue-specific dimensions in the expression space (left panel, different colors are used to distinguish tissues). Genes with housekeeping properties populate a specific area in the PCA space (right panel). Top 50 genes were selected with the highest pairwise similarity of across-tissue protein abundance patterns of a gene against an in silico ‘ideal’ housekeeping profile (similar expression of 10 000 PPM across all tissues and cells and 0 PPM across fluids).

**Figure 7.**
Protein abundance data in GeneCards. A screenshot of protein expression based data for the gene DPYSL2, including (i) protein expression chart; (ii) a list of tissues in which the gene is differentially expressed; (iii) a list of the gene expression partners. DPYSL2 plays a role in neuronal development and polarity, as well as in axon growth and guidance, neuronal growth cone collapse and cell migration (25). Protein expression charts are created via GeneCards automated expression charts pipeline. In order to optimize user perception of the expression, values were displayed using a special root scale (35). This scale enables viewing many orders of magnitude like on a logarithmic scale, but preserves certain characteristics of a linear scale in which the differences increase with the orders of magnitude.

**Figure 8.**
Pairwise correlation distribution. The fraction distribution of the pairwise Pearson’s correlation coefficients for the 16 900 proteome-annotated genes is plotted along random generated genes. The ratio between compared fractions distributions was plotted, disregarding bins with extremely low (<8 × 10⁻⁵) fraction values. Real data vectors exhibit significantly different (positively) correlation values than the random controls (Wilcoxon rank sum tailed test, P < 10⁻⁵).

**Figure 9.**
Pairwise correlation distribution. The fraction distribution of the pairwise Pearson’s correlation coefficients for the 16 900 proteome-annotated genes is plotted along gene pairs sharing functional attributes, namely: (A) sequence paralogs, (B) diseases, (C) PPIs and (D) biological pathways. The ratio between compared fractions distributions was plotted, disregarding bins with extremely low (<8 × 10⁻⁵) fraction values.

**Figure 10.**
Genes partner count and expression breadth. A heat map showing counts of genes according to bins of partner counts and expression breadth.

**Figure 11.**
Double hierarchical clustering of the differential expression binary matrix. Analysis included 16 366 genes with differential expression annotation, belonging to the 5839 non-zero patterns. Jaccard coefficient was used as the metric distance.

**Figure 12.**
Comparisons of protein and RNA vectors. (A) Distribution of Pearson’s correlation coefficient between protein and RNA tissue vectors of every gene in the protein–RNA comparison (red). This distribution is significantly different from the randomized controls (blue, P-value of t-test <10⁻³). (B) Sub-division of each correlation bin using gene fractions according to the number of tissues with protein abundance data. (C) Distribution of across-tissue averaged P/R cell copy number ratio of every gene in the protein–RNA comparison. Function enrichment analysis reveals that genes in the upper 10th percentile show a significant enrichment for metabolic and structural functions, while genes in the lower 10th percentile are enriched with signaling and regulation of transcription (Supplementary Tables S15 and S16). (D) Box plot of P/R ratios, showing selected 30 genes from distribution peak and both edges.

**Figure 13.**
Comparison of gene protein and RNA vectors. A 3D scatter plot of 13 411 genes, using protein–RNA correlation and P/R mean ratio as the X and Y axes, respectively. The Z axis along with the color scale represents the number of tissues with protein data.

See this image and copyright information in PMC

References

1. Mann M., Kulak N.A., Nagaraj N. et al. (2013) The coming age of complete, accurate, and ubiquitous proteomes. Mol. Cell, 49, 583–590. - PubMed
1. Legrain P., Aebersold R., Archakov A. et al. (2011) The human proteome project: current state and future direction. Mol. Cell Proteomics, 10, M111 009993.. - PMC - PubMed
1. Paik Y.K., Jeong S.K., Omenn G.S. et al. (2012) The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol., 30, 221–223. - PubMed
1. Kim M.S., Pinto S.M., Getnet D. et al. (2014) A draft map of the human proteome. Nature, 509, 575–581. - PMC - PubMed
1. Wilhelm M., Schlegl J., Hahne H. et al. (2014) Mass-spectrometry-based draft of the human proteome. Nature, 509, 582–587. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Genic insights from integrated human proteomics in GeneCards

Affiliations

Genic insights from integrated human proteomics in GeneCards

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases