Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun;10(6):M110.003822.
doi: 10.1074/mcp.M110.003822. Epub 2011 Mar 29.

Hierarchical clustering of shotgun proteomics data

Affiliations

Hierarchical clustering of shotgun proteomics data

Ville R Koskinen et al. Mol Cell Proteomics. 2011 Jun.

Abstract

A new result report for Mascot search results is described. A greedy set cover algorithm is used to create a minimal set of proteins, which is then grouped into families on the basis of shared peptide matches. Protein families with multiple members are represented by dendrograms, generated by hierarchical clustering using the score of the nonshared peptide matches as a distance metric. The peptide matches to the proteins in a family can be compared side by side to assess the experimental evidence for each protein. If the evidence for a particular family member is considered inadequate, the dendrogram can be cut to reduce the number of distinct family members.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(i) A and B are distinct proteins, with no shared matches. There is evidence for both and both should be listed in the report; (ii) A and B are same-set proteins (also termed indistinguishable (1) or equivalent (3, 9)). The report should make it clear that both are equally valid assignments of the peptide matches and that either could be present in the sample. The possibility that both are present is rejected by parsimony; (iii) B is a subset of protein A. B may be present in the sample, but there is no evidence for this, so by parsimony it is relegated to an inferior status or dropped entirely from the report; (iv) A and B are related through shared matches but there is evidence for both being present in the sample. They should both be listed in the report, ideally with some indication of their relationship; (v) Protein C is an intersection protein, a subset of the combined matches to A and B (also termed subsumable (1, 3, 9)). By parsimony, it is relegated to an inferior status or dropped entirely from the report.
Fig. 2.
Fig. 2.
A truly minimal list of proteins would contain only A, B, and C. Protein D would be an intersection protein, even though it might have the greatest coverage and the highest score.
Fig. 3.
Fig. 3.
The grouping and filtering algorithm used in the new report. Note 1. The protein score is the average score threshold plus the sum over all peptide matches of the excess of the peptide match score over the score threshold. Nonsignificant peptide match scores make no contribution to the protein score, which is essential to avoid scores from random matches accumulating into substantial protein scores when the search contains a large number of spectra relative to the number of database entries. If a protein contains a single significant peptide match, the protein score is the same as the peptide match score. If there are duplicate matches, each contributes to the score. The default peptide match score threshold is the Mascot homology threshold, which is an empirical estimate of whether the score is an outlier. The significance threshold is 5% by default, and can be changed in the user interface. Note 2. When selecting the proteins that contain a given match, if there are other matches to the same spectrum with identical scores, proteins containing these other matches are also selected. The rationale is that, if there is no score difference between two sequences, we cannot distinguish them and they should be treated symmetrically. An example would be two peptides that had identical sequences apart from interchanges of I and L. Where there are duplicate matches, it can happen that for one spectrum, two similar sequences get the same score whereas, for another spectrum, they get different scores. In such cases, the sequences are treated as distinguishable. Note 3. Finding all possible intersections so as to achieve maximum parsimony is an “NP-hard” problem. We use an iterative method to rapidly find a solution that is acceptably close to the optimum. The algorithm is based on the “greedy set cover algorithm” (35) used in IDPicker (9). We have added two pruning steps that further reduce the number of proteins to inspect. In the following pseudo code, a free peptide means a peptide that is not contained by any protein in the result set S1:
  1. Let P be the set of all proteins in the family and S1 and S2 be empty sets of proteins.

  2. While there are proteins in P:

    1. 2.1. Select a protein p from P such that p covers the most free peptides, meaning p has the maximum number of peptides not yet in any protein in S1.

    2. 2.2. If at least one of p 's peptides is contained by a protein in S1:

      1. 2.2.1. Let Q be a subset of S1 where all proteins in Q share at least one peptide with p.

      2. 2.2.2. For each protein q in Q: if all of q 's peptides are contained by p plus the other proteins in Q, q would be an intersection after the addition of p. Move q from S1 to S2.

    3. 2.3. Move p from P to S1.

    4. 2.4. For each protein q in P: move q from P to S2 if q is an intersection in S1, meaning all of q 's peptides are contained by some set of proteins in S1.

  3. The set of proteins S1 contains a heuristic minimum set of proteins covering all peptides in this family, whereas S2 contains proteins that are subsets or intersections of proteins in S1. (The reason step 2.2 is before step 2.3 is that this makes it easier to prove S1 never contains proteins that are subsets or intersections.)

Note 4. In our terminology, a family is a set of proteins related by shared peptide matches. A family member is a set of proteins corresponding to the same set of peptide matches. There is no evidence to distinguish the proteins in a family member and no reason to prefer one over another. The choice of one protein from a family member as the anchor, to be listed first and used to label the dendrogram, does not indicate a preference. Note 5. The distance metric is the score of the nonshared peptide sequences, but similar results would be obtained using peptide lengths or simply the count. The reason for choosing score is that our confidence in the match increases with score. We contend that a nonshared match close to the score threshold is less evidence for the presence of two proteins than a match with a very high score. Functions from the Cluster 3.0 library (Michiel de Hoon, University of Tokyo, Human Genome Center) are called to perform agglomerative, single linkage clustering.
Fig. 4.
Fig. 4.
A protein family found by searching the iPRG2008 data against the IPRG2008 study database. The display has been partly expanded to show details of the anchor proteins and the peptide matches. Only peptide matches with scores above a 5% homology threshold are displayed.
Fig. 5.
Fig. 5.
A family of cytochrome P450 proteins from the same report as Fig. 4.
Fig. 6.
Fig. 6.
(upper) Family 15 from the results of searching the iPRG2008 data against the Mus division of EST sequences from EMBL release 104. (lower) Alignment of the two sequences using ClustalW.

References

    1. Nesvizhskii A. I., Aebersold R. (2005) Interpretation of shotgun proteomic data - The protein inference problem. Mol. Cell. Proteomics 4, 1419–1440 - PubMed
    1. Li N., Wu S. F., Zhu Y. P., Yang X. M. (2009) The progress of protein quality control methods in shotgun proteomics. Prog. Biochem. Biophys. 36, 668–675
    1. Yang X., Dondeti V., Dezube R., Maynard D. M., Geer L. Y., Epstein J., Chen X., Markey S. P., Kowalak J. A. (2004) DBParser: web-based software for shotgun proteomic data analyses. J. Proteome Res. 3, 1002–1008 - PubMed
    1. Slotta D. J., McFarland M. A., Markey S. P. (2010) MassSieve: Panning MS/MS peptide data for proteins. Proteomics 10, 3035–3039 - PMC - PubMed
    1. Kristensen D. B., Brønd J. C., Nielsen P. A., Andersen J. R., Sørensen O. T., Jørgensen V., Budin K., Matthiesen J., Venø P., Jespersen H. M., Ahrens C. H., Schandorff S., Ruhoff P. T., Wisniewski J. R., Bennett K. L., Podtelejnikov A. V. (2004) Experimental Peptide Identification Repository (EPIR): An integrated peptide-centric platform for validation and mining of tandem mass spectrometry data. Mol. Cell. Proteomics 3, 1023–1038 - PubMed

Substances

LinkOut - more resources