Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Aug;7(8):3382-95.
doi: 10.1021/pr800140v. Epub 2008 Jun 18.

Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification

Affiliations

Separating the wheat from the chaff: unbiased filtering of background tandem mass spectra improves protein identification

Magno Junqueira et al. J Proteome Res. 2008 Aug.

Abstract

Only a small fraction of spectra acquired in LC-MS/MS runs matches peptides from target proteins upon database searches. The remaining, operationally termed background, spectra originate from a variety of poorly controlled sources and affect the throughput and confidence of database searches. Here, we report an algorithm and its software implementation that rapidly removes background spectra, regardless of their precise origin. The method estimates the dissimilarity distance between screened MS/MS spectra and unannotated spectra from a partially redundant background library compiled from several control and blank runs. Filtering MS/MS queries enhanced the protein identification capacity when searches lacked spectrum to sequence matching specificity. In sequence-similarity searches it reduced by, on average, 30-fold the number of orphan hits, which were not explicitly related to background protein contaminants and required manual validation. Removing high quality background MS/MS spectra, while preserving in the data set the genuine spectra from target proteins, decreased the false positive rate of stringent database searches and improved the identification of low-abundance proteins.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of a queried MS/MS spectrum (peaks up) and a background spectrum (peaks down) acquired from precursors with matching m/z and charge state. In the spectra at the left-hand side panel, there are almost no unmatched high-intensity peaks, and therefore, the queried spectrum is, most likely, background. In the right-hand side panel, the queried spectrum contains meaningful nonoverlapping fragments, and despite pronounced background signature (ions within m/z 450 – 550 and above m/z 800), it should not be removed from the data set.
Figure 2
Figure 2
Cumulative distribution of scores of best matches of MS/MS spectra against the background library. Test empirical distribution for a data set of high-quality MS/MS spectra is presented as black circles and its Weibull approximation as a smooth red line. The x-axis represents the shortest distance obtained for each of the candidate spectra. The y-axis represents the probability of obtaining equal or a smaller distance at random.
Figure 3
Figure 3
Workflow for filtering MS/MS spectra against a background library implemented in the EagleEye software. For each spectrum i out of the submitted pool, the software first identified the spectrum (or spectra) j in the background library, whose precursor masses (within the specified mass tolerance ΔM) and charges matched. Then, dissimilarity distance Dij was computed between spectra i and j by considering intensities of unmatched fragment peaks with mass tolerance Δm. To minimize the contribution of chemical noise, the unmatched intensities were taken with weights of 2, 1 and 4 for m/z ranges of A, B and C, respectively. Dij was further compared with the threshold distance Dp computed for the user-defined p-value according to eq 2. If Dij exceeded Dp, the compared spectra i and j were judged as significantly different and the probed spectrum i was declared nonbackground, even if they comprise some overlapping fragment peaks. Otherwise, spectrum i was considered as background. Note that the algorithm does not rely on pairwise correlation of abundances of fragment peaks with overlapping m/z.
Figure 4
Figure 4
Base peak traces of LC-MS/MS runs of a typical control in-gel digest (A) and blank injection of 4 µL of 0.1% TFA (sample loading buffer) (B). Only multiply charged ions were selected for MS/MS in DDA experiments. The analysis of control in-gel digest produced 2087 MS/MS spectra, among them, 29 (4+); 357 (3+); 1701 (2+); blank injection, 66 MS/MS spectra; 4 (3+), 62 (2+). The assumed charges of the precursors are in parenthesis. MASCOT searches only identified trypsin and a variety of keratins.
Figure 5
Figure 5
Evaluation of the filtering efficiency using a model data set. Bars represent the percentage of removed spectra with a given peptide ion score and p-value. The exact number of removed spectra is presented at each bar; bars without numbers indicate zero values. The total number of background spectra in the data set was 1659, and each peptide ion score bin contained 100 spectra. In panel A, mass tolerance was 0.01 Da for precursor ions and 0.6 Da for fragment ions. In panel B, p-value was fixed at 0.01 and precursor mass tolerance (in Da) varied, whereas fragment mass tolerance was 0.6 Da.
Figure 6
Figure 6
Filtering of a model spectra data set against a rich proteomics background library of 256 806 MS/MS spectra. Bars represent the percentage of removed spectra with a given peptide ion score and p-value. The exact number of removed spectra are presented at each bar; bars without numbers indicate zero values. The model data set contained 1659 background spectra from a separate control LC-MS/MS run and 80 nonbackground spectra per each peptide ion score bin. The precursor mass tolerance was 0.01 Da; the fragments mass tolerance was 0.6 Da.
Figure 7
Figure 7
MS/MS spectra removed by EagleEye filtering from the model data set (Figure 5A) under p-value of 0.01 were analyzed by several steps of stringent and sequence-similarity database searches. Data processing started with stringent (MASCOT) database searches with and without enzyme cleavage specificity and matched spectra were removed. The remaining spectra were interpreted de novo and sequence candidates submitted to MS BLAST search as described. MASCOT searches in steps I, II and III only hit trypsin and keratin peptides. Step IV only accounted for spectra whose de novo interpretation produced candidate peptides confidently aligned to trypsin and keratin sequences. In steps V and VI, candidate sequences were produced by de novo interpretation yet were not confidently matched by MS BLAST. PepNovo score less than 6 usually indicates poor quality sequence predictions. De novo interpretation of spectra at step VII failed to produce any sequence candidates. The analyzed data set comprised, in total, 1489 background MS/MS spectra acquired from multiply charged precursors.
Figure 8
Figure 8
Representative diagram of the distribution of MS BLAST hits obtained in searches with the raw (unfilled bars) and filtered (filled bars) queries. In the sample, both MASCOT and MS BLAST searches produced cross-species hits to actins from various plant species. True hits bars stand for actin and related entries; K-T bars, hits annotated as trypsins and keratins from various species; “Orphan” hits, statistically confident hits, unrelated to actins and not explicitly annotated as trypsins and keratins. The unprocessed data set contained 1821 MS/MS spectra, from which EagleEye filtering under p = 0.01 removed 1117 spectra. MS BLAST searches with raw and filtered queries took 38 and 8 min, respectively.
Figure 9
Figure 9
Cumulative distributions of peptide ion scores obtained in database searches of 10 independent LC-MS/MS runs against MSDB (A) and decoy (B) databases. Data points indicate the number of matched peptide with the given or lower score before (filled squares) and after (filled triangles) EagleEye filtering. Panel A presents the distribution of peptides matched to plant protein entries only. Note that the distribution of genuine hits was only slightly affected at the low scoring end, while at the same scores, decoy hits were observed in substantially lower numbers because of massive removal of background MS/MS spectra (B).
Figure 10
Figure 10
Cumulative distribution of peptide ion scores (x-axes) of decoy database hits (y-axes) before and after filtering of MS/MS data sets, acquired from immunoaffinity isolation experiments. Filtering was performed against a library of 256 806 tandem mass spectra obtained in 63 independent control experiments using unrelated baits.

Similar articles

Cited by

References

    1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. - PubMed
    1. Pandey A, Mann M. Proteomics to study genes and genomes. Nature. 2000;405:837–846. - PubMed
    1. Venable JD, Dong MQ, Wohlschlegel J, Dillin A, Yates JR. Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra. Nat. Methods. 2004;1:39–45. - PubMed
    1. Chalkley RJ, Baker PR, Hansen KC, Medzihradszky KF, Allen NP, Rexach M, Burlingame AL. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: I. How much of the data is theoretically interpretable by search engines. Mol. Cell. Proteomics. 2005;4:1189–1193. - PubMed
    1. Cutillas PR, Biber J, Marks J, Jacob R, Stieger B, Cramer R, Waterfield M, Burlingame AL, Unwin RJ. Proteomic analysis of plasma membrane vesicles isolated from the rat renal cortex. Proteomics. 2004;5:101–112. - PubMed

Publication types