. 2020 Jan;19(1):198-208.

doi: 10.1074/mcp.TIR119.001752. Epub 2019 Nov 15.

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Richard S Johnson¹, Brian C Searle², Brook L Nunn³, Jason M Gilmore³, Molly Phillips⁴, Chris T Amemiya⁵, Michelle Heck⁶, Michael J MacCoss³

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, Washington. Electronic address: rj8@uw.edu.
² Institute for Systems Biology, Seattle, Washington; Proteome Software, Portland, Oregon.
³ Department of Genome Sciences, University of Washington, Seattle, Washington.
⁴ Department of Biology, University of Washington, Seattle, Washington; School of Natural Sciences, University of California, Merced, California.
⁵ School of Natural Sciences, University of California, Merced, California.
⁶ United States Department of Agriculture, Agricultural Research Service, Ithaca, New York.

PMID: 31732549
PMCID: PMC6944239
DOI: 10.1074/mcp.TIR119.001752

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Richard S Johnson et al. Mol Cell Proteomics. 2020 Jan.

. 2020 Jan;19(1):198-208.

doi: 10.1074/mcp.TIR119.001752. Epub 2019 Nov 15.

Authors

Richard S Johnson¹, Brian C Searle², Brook L Nunn³, Jason M Gilmore³, Molly Phillips⁴, Chris T Amemiya⁵, Michelle Heck⁶, Michael J MacCoss³

Affiliations

¹ Department of Genome Sciences, University of Washington, Seattle, Washington. Electronic address: rj8@uw.edu.
² Institute for Systems Biology, Seattle, Washington; Proteome Software, Portland, Oregon.
³ Department of Genome Sciences, University of Washington, Seattle, Washington.
⁴ Department of Biology, University of Washington, Seattle, Washington; School of Natural Sciences, University of California, Merced, California.
⁵ School of Natural Sciences, University of California, Merced, California.
⁶ United States Department of Agriculture, Agricultural Research Service, Ithaca, New York.

PMID: 31732549
PMCID: PMC6944239
DOI: 10.1074/mcp.TIR119.001752

Abstract

The analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extinct organisms.

Keywords: Algorithms; Caenorhabditis elegans; data evaluation; de novo sequencing; mass spectrometry; metaproteomics; peptides*; protein identification; quality control and metrics; sequencing ms; tandem mass spectrometry.

PubMed Disclaimer

Figures

**Fig. 1.**
**Mass-based alignment.** A, The alignment in this panel is for illustrative purposes and shows some common *de novo* sequencing errors. One error is the inability to delineate the two N-terminal amino acids, and in this example they are reversed. Absence of a sequencing fragment ion between the third and fourth residues (A and G) could be construed as Q in that sequence position (A and G have the same mass as Q). The reverse problem might occur in the presence of an extra ion (*e.g.* from a co-isolated peptide), and is illustrated by GG in the *de novo* sequence, when the database sequence at that position is actually N. Leucine and isoleucine usually cannot be differentiated based on mass. Finally, the combined mass of D plus L is the same as V plus E. In this mock example, the *de novo* sequence is quite different from the database sequence; however, a mass-based alignment suggests 100% identity. In addition to this mock alignment, real examples are shown (*B–D*) where the top *de novo* sequence was manually aligned with the top database sequence. These illustrate cases where 75%, 50%, and 9% of the amino acid masses are aligned, respectively. Labeled spectra for these alignments are shown in supplemental Fig. S2.

**Fig. 2.**
**Effect of mass accuracy and dissociation type on *de novo* sequencing using Novor.** Using a human cell line tryptic digest, data was acquired on a Lumos hybrid mass spectrometer using quadrupole isolation with beam CID and orbitrap MS2, resonance CID and orbitrap MS2, resonance CID and linear ion trap MS2, or beam CID and linear ion trap MS2. Data was searched against a human FASTA file, using an FDR cutoff of 0.001 and a Comet E-value cutoff of 0.001, PSMs were assumed correct, and compared with Novor results. Panel (A) shows the number of *de novo* sequences *versus* precision, and panel (B) shows the precision-recall curve for the same data.

**Fig. 3.**
**De novo analysis of FASTA file quality.** Using Novor, *de novo* sequences are derived for all tandem mass spectra. *De novo* sequences of suitably high quality are appended to create a single large protein, which is then itself appended to the FASTA file under study. A database search using Comet creates a pep.xml file output. Instances where a *de novo* sequence ranks slightly higher than a FASTA-derived sequence are re-ranked to put the FASTA peptide on top. PeptideProphet then establishes the FDR. The fraction of unique peptides matching the original FASTA file out of all unique sequences, including *de novo* sequences, represents the database quality metric.

**Fig. 4.**
**Searching LC-MS/MS data from human tryptic peptides against various FASTA files.** *De novo* sequences were appended to FASTA files from various chordates, a tape worm, and a human FASTA file that had been shuffled (keeping the original tryptic cleavage sites but shuffling intervening sequences). Using a PeptideProphet FDR of 0.01 and maximum Comet E-value of 0.01, the fraction of unique peptides that best matched to FASTA file sequences are shown.

**Fig. 5.**
**Comparison of FASTA files used to search sea water metaproteomics samples.** Using the FASTA files and raw data from May *et al.* (17), the *de novo* approach was used to assess FASTA file quality. Shown are the percentages of peptide-spectrum matches (PSMs) for proteomic data obtained from the Bering Strait (blue) and the Chukchi Sea (orange). For each location, the error bars represent 3-fold standard deviation from three technical replicates when searching against location-specific metapeptide and metagenome FASTA files, plus a non-redundant environmental sequence database from NCBI (env_nr).

See this image and copyright information in PMC

References

1. Eng J. K., Searle B. C., Clauser K. R., and Tabb D. L. (2011) A face in the crowd: recognizing peptides through database search. Mol. Cell. Proteomics 10, 1–9 - PMC - PubMed
1. Timmins-Schiffman E., May D. H., Mikan M., Riffle M., Frazar C., Harvey H. R., Noble W. S., and Nunn B. L. (2017) Critical decisions in metaproteomics: Achieving high confidence protein annotations in a sea of unknowns. ISME J. 11, 309–314 - PMC - PubMed
1. Cilia M., Tamborindeguy C., Rolland M., Howe K., Thannhauser T. W., and Gray S. (2011) Tangible benefits of the aphid Acyrthosiphon pisum genome sequencing for aphid proteomics: Enhancements in protein identification and data validation for homology-based proteomics. J. Insect Physiol. 57, 179–190 - PubMed
1. Ruggles K. V., Krug K., Wang X., Clauser K. R., Wang J., Payne S. H., Fenyo D., Zhang B., and Mani D. R. (2017) Methods, tools and current perspectives in proteogenomics. Mol. Cell. Proteomics 16, 959–981 - PMC - PubMed
1. Ma B., and Johnson R. (2012) De novo sequencing and homology searching. Mol. Cell. Proteomics 11, 1–16 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Affiliations

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases