Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jan;7(1):113-22.
doi: 10.1021/pr070361e. Epub 2007 Dec 8.

Clustering millions of tandem mass spectra

Affiliations

Clustering millions of tandem mass spectra

Ari M Frank et al. J Proteome Res. 2008 Jan.

Abstract

Tandem mass spectrometry (MS/MS) experiments often generate redundant data sets containing multiple spectra of the same peptides. Clustering of MS/MS spectra takes advantage of this redundancy by identifying multiple spectra of the same peptide and replacing them with a single representative spectrum. Analyzing only representative spectra results in significant speed-up of MS/MS database searches. We present an efficient clustering approach for analyzing large MS/MS data sets (over 10 million spectra) with a capability to reduce the number of spectra submitted to further analysis by an order of magnitude. The MS/MS database search of clustered spectra results in fewer spurious hits to the database and increases number of peptide identifications as compared to regular nonclustered searches. Our open source software MS-Clustering is available for download at http://peptide.ucsd.edu or can be run online at http://proteomics.bioprojects.org/MassSpec.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A pseudocode description of the approximate hierarchical clustering algorithm used by MS-Clustering.
Figure 2
Figure 2
Illustration of cluster appending. The set Clusters is a linked list where each element is a list of spectra. When the algorithm merges cluster c with a preceding cluster c′ it appends the list of spectra in cluster c to the list of spectra in cluster c′ and then removes the entry for c from the linked list of clusters.
Figure 3
Figure 3
Example of cluster for the peptide TGSVDIIVTDLPFGK. A cluster of three spectra is shown along with the consensus spectrum that was created from them. For each spectrum the InsPecT score is shown, along with the number of identified b/y-ions and the percentage of the spectrum’s intensity that is explained by the peptide’s fragment ions. Only the consensus spectrum had a suffciently high score to be positively identified in the database search using InsPecT. All spectra have a precursor charge 2 with precursor m/z errors below 1 Da. The figures’ x-axes represents the fragments’ m/z values and the y-axes represents the intensities.
Figure 4
Figure 4
Fragmented clusters. Spectra of the peptide VDDPNAEDKR from two clusters that were not joined are shown (the figure contains 3 spectra from each cluster, originally cluster I contained 6 spectra and cluster II contained 4 spectra). The figures’ x-axes represents the fragments’ m/z values and the y-axes represents the intensities.

References

    1. Eng JK, McCormack AL, Yates JR., III An Approach to Correlate Tandem Mass-Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spectrom. 1994;5:976–989. - PubMed
    1. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. - PubMed
    1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. - PubMed
    1. Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. J Proteome Res. 2004;3:958–964. - PubMed
    1. Tanner S, Shu H, Frank A, Mumby M, Pevzner P, Bafna V. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal Chem. 2005;77:4626–4639. - PubMed

Publication types