Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan;19(1):198-208.
doi: 10.1074/mcp.TIR119.001752. Epub 2019 Nov 15.

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Affiliations

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Richard S Johnson et al. Mol Cell Proteomics. 2020 Jan.

Abstract

The analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extinct organisms.

Keywords: Algorithms; Caenorhabditis elegans; data evaluation; de novo sequencing; mass spectrometry; metaproteomics; peptides*; protein identification; quality control and metrics; sequencing ms; tandem mass spectrometry.

PubMed Disclaimer

Figures

None
Graphical abstract
Fig. 1.
Fig. 1.
Mass-based alignment. A, The alignment in this panel is for illustrative purposes and shows some common de novo sequencing errors. One error is the inability to delineate the two N-terminal amino acids, and in this example they are reversed. Absence of a sequencing fragment ion between the third and fourth residues (A and G) could be construed as Q in that sequence position (A and G have the same mass as Q). The reverse problem might occur in the presence of an extra ion (e.g. from a co-isolated peptide), and is illustrated by GG in the de novo sequence, when the database sequence at that position is actually N. Leucine and isoleucine usually cannot be differentiated based on mass. Finally, the combined mass of D plus L is the same as V plus E. In this mock example, the de novo sequence is quite different from the database sequence; however, a mass-based alignment suggests 100% identity. In addition to this mock alignment, real examples are shown (B–D) where the top de novo sequence was manually aligned with the top database sequence. These illustrate cases where 75%, 50%, and 9% of the amino acid masses are aligned, respectively. Labeled spectra for these alignments are shown in supplemental Fig. S2.
Fig. 2.
Fig. 2.
Effect of mass accuracy and dissociation type on de novo sequencing using Novor. Using a human cell line tryptic digest, data was acquired on a Lumos hybrid mass spectrometer using quadrupole isolation with beam CID and orbitrap MS2, resonance CID and orbitrap MS2, resonance CID and linear ion trap MS2, or beam CID and linear ion trap MS2. Data was searched against a human FASTA file, using an FDR cutoff of 0.001 and a Comet E-value cutoff of 0.001, PSMs were assumed correct, and compared with Novor results. Panel (A) shows the number of de novo sequences versus precision, and panel (B) shows the precision-recall curve for the same data.
Fig. 3.
Fig. 3.
De novo analysis of FASTA file quality. Using Novor, de novo sequences are derived for all tandem mass spectra. De novo sequences of suitably high quality are appended to create a single large protein, which is then itself appended to the FASTA file under study. A database search using Comet creates a pep.xml file output. Instances where a de novo sequence ranks slightly higher than a FASTA-derived sequence are re-ranked to put the FASTA peptide on top. PeptideProphet then establishes the FDR. The fraction of unique peptides matching the original FASTA file out of all unique sequences, including de novo sequences, represents the database quality metric.
Fig. 4.
Fig. 4.
Searching LC-MS/MS data from human tryptic peptides against various FASTA files. De novo sequences were appended to FASTA files from various chordates, a tape worm, and a human FASTA file that had been shuffled (keeping the original tryptic cleavage sites but shuffling intervening sequences). Using a PeptideProphet FDR of 0.01 and maximum Comet E-value of 0.01, the fraction of unique peptides that best matched to FASTA file sequences are shown.
Fig. 5.
Fig. 5.
Comparison of FASTA files used to search sea water metaproteomics samples. Using the FASTA files and raw data from May et al. (17), the de novo approach was used to assess FASTA file quality. Shown are the percentages of peptide-spectrum matches (PSMs) for proteomic data obtained from the Bering Strait (blue) and the Chukchi Sea (orange). For each location, the error bars represent 3-fold standard deviation from three technical replicates when searching against location-specific metapeptide and metagenome FASTA files, plus a non-redundant environmental sequence database from NCBI (env_nr).

References

    1. Eng J. K., Searle B. C., Clauser K. R., and Tabb D. L. (2011) A face in the crowd: recognizing peptides through database search. Mol. Cell. Proteomics 10, 1–9 - PMC - PubMed
    1. Timmins-Schiffman E., May D. H., Mikan M., Riffle M., Frazar C., Harvey H. R., Noble W. S., and Nunn B. L. (2017) Critical decisions in metaproteomics: Achieving high confidence protein annotations in a sea of unknowns. ISME J. 11, 309–314 - PMC - PubMed
    1. Cilia M., Tamborindeguy C., Rolland M., Howe K., Thannhauser T. W., and Gray S. (2011) Tangible benefits of the aphid Acyrthosiphon pisum genome sequencing for aphid proteomics: Enhancements in protein identification and data validation for homology-based proteomics. J. Insect Physiol. 57, 179–190 - PubMed
    1. Ruggles K. V., Krug K., Wang X., Clauser K. R., Wang J., Payne S. H., Fenyo D., Zhang B., and Mani D. R. (2017) Methods, tools and current perspectives in proteogenomics. Mol. Cell. Proteomics 16, 959–981 - PMC - PubMed
    1. Ma B., and Johnson R. (2012) De novo sequencing and homology searching. Mol. Cell. Proteomics 11, 1–16 - PMC - PubMed

Publication types