Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Markus Brosch¹, Gary I Saunders, Adam Frankish, Mark O Collins, Lu Yu, James Wright, Ruth Verstraten, David J Adams, Jennifer Harrow, Jyoti S Choudhary, Tim Hubbard

Affiliations

PMID: 21460061
PMCID: PMC3083093
DOI: 10.1101/gr.114272.110

Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Markus Brosch et al. Genome Res. 2011 May.

. 2011 May;21(5):756-67.

doi: 10.1101/gr.114272.110. Epub 2011 Apr 1.

Authors

Markus Brosch¹, Gary I Saunders, Adam Frankish, Mark O Collins, Lu Yu, James Wright, Ruth Verstraten, David J Adams, Jennifer Harrow, Jyoti S Choudhary, Tim Hubbard

Affiliation

¹ The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

PMID: 21460061
PMCID: PMC3083093
DOI: 10.1101/gr.114272.110

Abstract

Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (http://vega.sanger.ac.uk).

PubMed Disclaimer

Figures

**Figure 1.**
Genome annotation pipeline. The database at the core of the system, GenoMS-DB, is built by integrating all peptides that are derived from an in silico digestion of available data sources (Ensembl, Vega, Augustus). Each peptide derived from these data sources is associated with its genomic locus and context (such as gene, transcript, exon, or splice site information). Peptides from FASTA protein databases can optionally be integrated but would lack genome mapping. A set of nonredundant in silico digested peptides is exported from GenoMS-DB to create the Mascot search database. Tandem MS spectra are searched with Mascot and post-processed with Mascot Percolator to derive accurate probabilities on a per PSM basis. A series of steps removes common contaminant sequences and low-scoring PSMs from the results, prior to storing the remaining identifications into the GenoMS-DB database. This integration of peptide-genome mapping together with peptide identifications enables streamlined analysis with standard SQL or visualization as a track in a genome browser via a DAS feature server. This is a flexible pipeline where alternative gene prediction tools could be used to provide source peptides, and alternative search engines and probability assessment algorithms could be integrated.

**Figure 2.**
Four-way Venn diagram showing distribution of origin of all identified peptides (A) and of all candidate peptides in the search database (B).

**Figure 3.**
(A) Cumulative gene identification rate as a function of the number of potential identifiable (hypothetical) peptides per protein-coding gene. (B) As before, but analysis for protein-coding exons. Note that considered peptides were fully tryptic, ranged from eight to 30 residues and were unique to a genomic locus. (C) Inverse cumulative validation rate of all protein-coding genes as a function of the number of peptides identified per gene. (D) As before, but for protein-coding exons.

**Figure 4.**
Correlation analysis between the number of identified peptides and the number of potential identifiable peptides per gene. Since many data points have the same x–y-values, the number of overlaying data points (genes) is encoded with the color gradient available from the legend.

**Figure 5.**
MS PSMs confirm the protein-coding potential of five alternatively translated products of the UDP-glucuronosyltransferase 1 family, polypeptide A6 (highlighted in bold). Ambiguous PSMs are shown for the two alternatively spliced transcripts of the Ugt1a6a and Ugt1a6b genes, respectively; and as clusters for each of the 3′ exons.

**Figure 6.**
(A) Species-specific EST and peptide evidence supports the annotation of a protein-coding object at this locus. (B) Clustal alignment of the polypeptide sequences of this novel single-exon object with the orthologous objects from the rat and human genomes, respectively. There is no transcriptional support for either the rat or human model; however, the CDS frame is highly conserved and intact. The residues of the mouse translation that are covered by PSMs are colored yellow.

**Figure 7.**
(A) Mouse *Ins2-Igf2* fusion object contains a valid CDS, supported by human cDNA and species-specific peptide evidence. (B) Clustal alignment of the translation of the mouse and human fusion transcripts. This human translation would be a target for the NMD pathway due to a frame-shift mutation, caused by the inclusion of an additional exon not present in the mouse transcript. The residues of the *Ins2* and *Igf2* polypeptides are colored blue and pink; with the known domains within each highlighted in gray.

**Figure 8.**
(A) MS PSMs allow the annotation of OTTMUST00000089966, for which there is no full-length transcriptional support. (B) Focus on the 2664-bp exon of this transcript. Exons of this length are uncommon and are problematic for manual annotation. Translation of this exon is shown, with the positions of four PSMs that cover this exon highlighted in yellow.

**Figure 9.**
(A) Two canonically splicing MS PSMs support the annotation of coding isoforms of a mouse *Ppia* processed pseudogene locus (OTTMUST00000018507). (B) Clustal alignment of parent PPIA protein (SWISS-PROT P17742), translations of both coding isoforms of the mouse *Ppia* processed pseudogene, and the putative translation of the pseudogene object. Residues of the coding isoforms that are covered by MS evidence are highlighted yellow. Residues of the parent polypeptide that are part of known domains are shown by colored boxes *above* the alignment.

See this image and copyright information in PMC

References

1. Abbott A 2010. Mouse project to find each gene's role: International Mouse Phenotyping Consortium launches with a massive funding commitment. Nature 465: 410 doi: 10.1038/465410a - PubMed
1. Amid C, Rehaume LM, Brown KL, Gilbert JG, Dougan G, Hancock RE, Harrow JL 2009. Manual annotation and analysis of the defensin gene cluster in the C57BL/6J mouse reference genome. BMC Genomics 10: 606 doi: 10.1186/1471-2164-10-606 - PMC - PubMed
1. Ashurst J, Chen C, Gilbert J, Jekosch K, Keenan S, Meidl P, Searle S, Stalker J, Storey R, Trevanion S, et al. 2005. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33: D459–D465 - PMC - PubMed
1. Babushok DV, Ostertag EM, Kazazian HH Jr 2007. Current topics in genome evolution: Molecular mechanisms of new gene formation. Cell Mol Life Sci 64: 542–554 - PMC - PubMed
1. Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J 2008. Retrocopy contributions to the evolution of the human genome. BMC Genomics 9: 466 doi: 10.1186/1471-2164-9-466 - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Affiliation

Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous