Hierarchical clustering of shotgun proteomics data
- PMID: 21447708
- PMCID: PMC3108832
- DOI: 10.1074/mcp.M110.003822
Hierarchical clustering of shotgun proteomics data
Abstract
A new result report for Mascot search results is described. A greedy set cover algorithm is used to create a minimal set of proteins, which is then grouped into families on the basis of shared peptide matches. Protein families with multiple members are represented by dendrograms, generated by hierarchical clustering using the score of the nonshared peptide matches as a distance metric. The peptide matches to the proteins in a family can be compared side by side to assess the experimental evidence for each protein. If the evidence for a particular family member is considered inadequate, the dendrogram can be cut to reduce the number of distinct family members.
Figures



Let P be the set of all proteins in the family and S1 and S2 be empty sets of proteins.
While there are proteins in P:
2.1. Select a protein p from P such that p covers the most free peptides, meaning p has the maximum number of peptides not yet in any protein in S1.
2.2. If at least one of p 's peptides is contained by a protein in S1:
2.2.1. Let Q be a subset of S1 where all proteins in Q share at least one peptide with p.
2.2.2. For each protein q in Q: if all of q 's peptides are contained by p plus the other proteins in Q, q would be an intersection after the addition of p. Move q from S1 to S2.
2.3. Move p from P to S1.
2.4. For each protein q in P: move q from P to S2 if q is an intersection in S1, meaning all of q 's peptides are contained by some set of proteins in S1.
The set of proteins S1 contains a heuristic minimum set of proteins covering all peptides in this family, whereas S2 contains proteins that are subsets or intersections of proteins in S1. (The reason step 2.2 is before step 2.3 is that this makes it easier to prove S1 never contains proteins that are subsets or intersections.)



References
-
- Nesvizhskii A. I., Aebersold R. (2005) Interpretation of shotgun proteomic data - The protein inference problem. Mol. Cell. Proteomics 4, 1419–1440 - PubMed
-
- Li N., Wu S. F., Zhu Y. P., Yang X. M. (2009) The progress of protein quality control methods in shotgun proteomics. Prog. Biochem. Biophys. 36, 668–675
-
- Yang X., Dondeti V., Dezube R., Maynard D. M., Geer L. Y., Epstein J., Chen X., Markey S. P., Kowalak J. A. (2004) DBParser: web-based software for shotgun proteomic data analyses. J. Proteome Res. 3, 1002–1008 - PubMed
-
- Kristensen D. B., Brønd J. C., Nielsen P. A., Andersen J. R., Sørensen O. T., Jørgensen V., Budin K., Matthiesen J., Venø P., Jespersen H. M., Ahrens C. H., Schandorff S., Ruhoff P. T., Wisniewski J. R., Bennett K. L., Podtelejnikov A. V. (2004) Experimental Peptide Identification Repository (EPIR): An integrated peptide-centric platform for validation and mining of tandem mass spectrometry data. Mol. Cell. Proteomics 3, 1023–1038 - PubMed
MeSH terms
Substances
LinkOut - more resources
Full Text Sources