Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Aug 31:8:323.
doi: 10.1186/1471-2105-8-323.

Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics

Affiliations

Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics

Xinning Jiang et al. BMC Bioinformatics. .

Abstract

Background: In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, Delta Cn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now.

Results: In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data.

Conclusion: Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of peptides identified from human liver tissue lysate by SEQUEST. A) Singly charged peptides; B) Doubly charged peptides; C) Triply charged peptides. Each data point represents a peptide identification from the composite database: cross represents peptide identification from reversed sequence while square indicates peptide identification from forward sequence. Cumulate curves drawn in each graph are 1% false-discovery curves. Each point on curves indicates a filtering criterion leading to peptide identification with FDR of 1%, and the identified peptides by each criterion present in the region where Xcorr and ΔCn scores are higher than the Xcorr and ΔCn cutoffs in each set of criteria. Graphs were drawn using the Speed Model by Origin 7.5 with 5000 max points per curve, and three raw graphs with all data points were shown [see Additional file 1].
Figure 2
Figure 2
Relationship between the number of peptide identifications and Xcorr values in different criteria which leaded to these identifications for human liver tissue lysate at same FDR (<1%). To achieve less than 1% FDR, ΔCn cutoff for each criterion changes with the Xcorr cutoff. Curves for three different charge states were drawn separately: A) for singly charged peptides, B) for triply charged peptides and C) for doubly charged peptides. D) is the zoomed curve for singly charged peptides.
Figure 3
Figure 3
Dependence of fitness on generations for doubly charged peptides. Fitness for each individual (criterion) represents the number of peptide identifications filtered by this criterion. Fitness of the fittest individual in each generation was represented as black dots.
Figure 4
Figure 4
Overlap of peptides identified by SFOER and PeptideProphet for human liver tissue lysate. The numbers of peptide identifications by one or both algorithms are indicated, e.g., 27,272 peptides are identified by both algorithms (intersection).
Figure 5
Figure 5
Evaluation of the classification performances of SFOER and PeptideProphet with standard protein mixture. A) Number of correct and incorrect peptide identifications by SFOER and PeptideProphet under different FDR, where incorrect peptide identification indicates peptide assignment from forward yeast database while correct one is from known standard proteins and trypsin. B) Predicated and observed FDRs. Observed FDR is calculated as the number of peptide identifications not from standard proteins over total peptide identifications, while predicated FDR is calculated using equation (1). Observed FDR for SFOER are presented by open circles, while observed FDR for PeptideProphet are represented by filled circles.
Figure 6
Figure 6
Flowchart of the optimization procedure using genetic algorithm. It starts with the initialization phase, which randomly generates the initial population P0. Population in the next generation Pi+1 is obtained by applying genetic operators on current population Pi. Fitness for each individual (criterion) is evaluated as the number of filtered peptides. Evolution continues until a terminating condition is reached. The selection, mutation and cross-over operator are used in genetic algorithm.

Similar articles

Cited by

References

    1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. doi: 10.1038/nature01511. - DOI - PubMed
    1. Yates JR. Mass spectral analysis in proteomics. Annu Rev Biophys Biomolec Struct. 2004;33:297–316. doi: 10.1146/annurev.biophys.33.111502.082538. - DOI - PubMed
    1. Koller A, Washburn MP, Lange BM, Andon NL, Deciu C, Haynes PA, Hays L, Schieltz D, Ulaszek R, Wei J, Wolters D, Yates JR. Proteomic survey of metabolic pathways in rice. Proc Natl Acad Sci U S A. 2002;99:11969–11974. doi: 10.1073/pnas.172183199. - DOI - PMC - PubMed
    1. Wu CC, MacCoss MJ, Howell KE, Yates JR. A method for the comprehensive proteomic analysis of membrane proteins. Nat Biotechnol. 2003;21:532–538. doi: 10.1038/nbt819. - DOI - PubMed
    1. Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL, Witney AA, Wolters D, Wu YM, Gardner MJ, Holder AA, Sinden RE, Yates JR, Carucci DJ. A proteomic view of the Plasmodium falciparum life cycle. Nature. 2002;419:520–526. doi: 10.1038/nature01107. - DOI - PubMed

Publication types