Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

Patrick Willems¹, Igor Fijalkowski¹, Petra Van Damme²

Affiliations

¹ Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium.
² Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium petra.vandamme@ugent.be.

PMID: 33109751
PMCID: PMC7593589
DOI: 10.1128/mSystems.00833-20

Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

Patrick Willems et al. mSystems. 2020.

. 2020 Oct 27;5(5):e00833-20.

doi: 10.1128/mSystems.00833-20.

Authors

Patrick Willems¹, Igor Fijalkowski¹, Petra Van Damme²

Affiliations

¹ Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium.
² Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium petra.vandamme@ugent.be.

PMID: 33109751
PMCID: PMC7593589
DOI: 10.1128/mSystems.00833-20

Abstract

Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation.IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.

Keywords: Deinococcus radiodurans; Salmonella; alternative translation initiation; bacterial genome (re)annotation; chimeric spectra; riboproteogenomics; spectral re-scoring.

PubMed Disclaimer

Figures

**FIG 1**
Proteomics pipeline using Percolator (15) postprocessing of 34 features (Table S1). S. Typhimurium protein expression was studied by ribosome profiling (ribo-seq) data (12) and proteomic shotgun analysis. Spectra were searched by MS-GF+ against an *in silico*-digested tryptic peptide database (see the text). The retention times (RT) of the top-scoring 1,000 nonredundant peptides (highest MS-GF+ score) were used to train an RT model with ELUDE (28) and calculate the deviation of empirical and predicted RT (ΔRT). Besides ΔRT, an additional 10 PSM quality features were measured, constituting the auxiliary feature set. The MS-GF+, auxiliary, or combined feature set was used by Percolator (15) for re-scoring of PSMs. Q values were re-estimated in a class-specific manner for annotated and novel peptides. Identified fragment ions were removed from spectra with a significant PSM (Q value < 0.01; combined feature set) and searched iteratively to identify cofragmented peptides. Per search, identical search and postprocessing steps were repeated as for the first search, except that the trained RT model was used from the first search and a wider precursor mass tolerance was applied (as described by Shteynberg et al. [26]).

**FIG 2**
Annotated peptide identification using a chimeric postprocessing pipeline. (A) Number of nonredundant peptide identifications (y axis) at Percolator peptide Q-value thresholds (x axis) in the first (left), second (middle), and third (right) searches. Percolator was run in parallel using the default MS-GF+ features (blue), the auxiliary features (purple), and the combined feature set (orange). (B) Scatterplot of MS-GF+ RawScore and Pearson correlation (spec_pears_norm by reScore [22]) for PSMs in the three iterative search rounds. Only features for the PSM with the highest Percolator-recalibrated score were displayed. (C) Overlap between peptides identified by MaxQuant and the three search rounds of the proteomics pipeline (combined feature set; peptide Q value ≤ 0.01).

**FIG 3**
Chimeric searches improve detection of low-abundance proteins. (A) Pearson correlation (r = 0.49) of protein abundance (MaxQuant log₂ protein intensity [x axis]) and ribo-seq translation levels (log₂ FPKM + 1 [y axis]). A total of 2,573 proteins were plotted (Table S5). (B) Ribo-seq translation levels for proteins matched by at least one unique peptide in the first search (including ambiguous peptide-to-protein assignments) (left) and for proteins exclusively identified in the chimeric searches by at least one unique peptide in at least two samples (right). The low-abundance AstD protein is indicated in orange (ribo-seq FPKM, 0.80). (C) Annotated MS/MS scan from the doubly charged RVVVGLLLGEVIR peptide identified in the replicate 1 sample at an OD of 0.8 in the first search round (left) and the double-charged AGLPAGVLNLVQGGR peptide in the second search round (right). Matched b/y ions are indicated in blue and red, respectively.

**FIG 4**
Unannotated peptide identification using a chimeric postprocessing pipeline. (A) Number of nonredundant peptides (y axis) at Percolator peptide Q-value thresholds (x axis) in the first and second searches. Percolator was run in parallel using the default MS-GF+ features (blue), the auxiliary features (purple), and the combined feature set (orange). (B) Scatter plot of MS-GF+ RawScore and Pearson correlation (spec_pears_norm by reScore [22]) (Table S1) for PSMs in the two search rounds. Only features for the PSM with highest Percolator recalibrated score after postprocessing using the combined feature set are shown. (C) Distributions of MS-GF+ score, Pearson correlation, and logged explained ion current (lnExplainedIonCurrent) distribution for PSMs with Q values below 1% (combined feature set) for annotated peptides (green) or below 5% for unannotated peptides (orange). (D) Ribo-seq coverage for annotated and novel peptides identified in the first and second searches using different feature sets for combined FDR or class-specific FDR estimation. Ribo-seq reads per kilobase of transcript per million reads mapped (RPKM) were calculated for genomic regions encoding the respective peptide, distinguishing highly translated regions (RPKM > 10), low-translated regions (RPKM < 10), and peptide genomic regions without ribosome footprints (RPKM = 0).

**FIG 5**
S. Typhimurium unannotated protein-coding regions. (A) (Top) Venn diagram of unannotated peptides identified at a peptide Q value of ≤0.05 after Percolator processing using the MS-GF+, auxiliary, or combined feature sets. (Bottom) Peptide-to-ORF assignment, resulting in 66 high-confidence ORFs after manual inspection (see Materials and Methods). (B) (Top) Annotated MS/MS spectrum of the doubly charged SSLLSTHK; (bottom) MS²PIP-predicted MS/MS spectrum. Features used for Percolator postprocessing are displayed. (C) Overview of 66 high-confidence unannotated protein-coding regions. Bars are indicative of protein size; gray indicates Ensembl-annotated regions, whereas orange indicates unannotated protein regions. The corresponding Ensembl annotations are indicated on the left, whereas for intergenic ORFs, chromosomal locations and identical proteins identified by protein BLAST are displayed. In addition, whether ORF delineation corresponds to *de novo* predictions of REPARATION (12) (*), DeepRibo (13) (†), ranSEP-predicted ORFs (ranSEP score ≥ 0.5 [6]) (¶), or matched S. Typhimurium str. LT2 annotation (§) is indicated. Eight translation products identified only by peptides due to re-scoring and/or iterative searching are indicated in bold. Nt, N-terminal.

**FIG 6**
Integrated genomics viewer (IGV) (58) genome view of ribo-seq read density and identified unannotated peptides for Chr:3,708,198-3,708,527 (A) and Chr:2,819,729-2,820,325 (B).

**FIG 7**
Proteomic shotgun data sets across the bacterial phylogeny available in the PRIDE repository (36). The bacterial phylogeny was adapted from the work of Hug et al. (59), omitting the *Archaea* and *Eukarya* for ease of visualization. PRIDE accession records of bacteria were retrieved and catalogued using NCBI taxonomy identifiers. Proteomic data set identifiers plotted are given in Table S8 with node sizes corresponding to the number of available data sets and plotted using FigTree (version 1.4.3. [2009]; http://tree.bio.ed.ac.uk/software/figtree/).

**FIG 8**
Putative ORFs in Deinococcus radiodurans strain R1 with matching proteogenomic evidence. (A) Categorization of 59 putative ORFs with at least two matching peptides not present in Ensembl or NCBI annotation (assembly ASM856v1). (B to F) Genome view by IGV (58) showing identified annotated (dark gray) or novel (orange) peptides and their matching (longest) ORFs. Putative novel translation start sites are indicated by arrowheads labeled with the respective start codon. In addition, stranded mRNA-seq coverage was displayed from unstressed wild-type D. radiodurans strain R1 (41). Genome region coordinates are 1:1,396,647-1,396,869 (B), 1:2,419,754-2,421,036 (C), 1:264,080-266,689 (D), 1:2,162,416-2,163,711 (E), and 1:2,518,399-2,519,021 (F).

See this image and copyright information in PMC

References

1. Poptsova MS, Gogarten JP. 2010. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology (Reading) 156:1909–1917. doi: 10.1099/mic.0.033811-0. - DOI - PubMed
1. Warren AS, Archuleta J, Feng WC, Setubal JC. 2010. Missing genes in the annotation of prokaryotic genomes. BMC Bioinformatics 11:131. doi: 10.1186/1471-2105-11-131. - DOI - PMC - PubMed
1. Wood DE, Lin H, Levy-Moonshine A, Swaminathan R, Chang YC, Anton BP, Osmani L, Steffen M, Kasif S, Salzberg SL. 2012. Thousands of missed genes found in bacterial genomes and their analysis with COMBREX. Biol Direct 7:37. doi: 10.1186/1745-6150-7-37. - DOI - PMC - PubMed
1. Fijalkowska D, Fijalkowski I, Willems P, Van Damme P. 2020. Bacterial riboproteogenomics: the era of N-terminal proteoform existence revealed. FEMS Microbiol Rev 44:418–431. doi: 10.1093/femsre/fuaa013. - DOI - PubMed
1. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O’Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD. 2018. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 46:D851–D860. doi: 10.1093/nar/gkx1068. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

Affiliations

Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources