Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May;21(5):756-67.
doi: 10.1101/gr.114272.110. Epub 2011 Apr 1.

Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Affiliations

Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome

Markus Brosch et al. Genome Res. 2011 May.

Abstract

Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (http://vega.sanger.ac.uk).

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Genome annotation pipeline. The database at the core of the system, GenoMS-DB, is built by integrating all peptides that are derived from an in silico digestion of available data sources (Ensembl, Vega, Augustus). Each peptide derived from these data sources is associated with its genomic locus and context (such as gene, transcript, exon, or splice site information). Peptides from FASTA protein databases can optionally be integrated but would lack genome mapping. A set of nonredundant in silico digested peptides is exported from GenoMS-DB to create the Mascot search database. Tandem MS spectra are searched with Mascot and post-processed with Mascot Percolator to derive accurate probabilities on a per PSM basis. A series of steps removes common contaminant sequences and low-scoring PSMs from the results, prior to storing the remaining identifications into the GenoMS-DB database. This integration of peptide-genome mapping together with peptide identifications enables streamlined analysis with standard SQL or visualization as a track in a genome browser via a DAS feature server. This is a flexible pipeline where alternative gene prediction tools could be used to provide source peptides, and alternative search engines and probability assessment algorithms could be integrated.
Figure 2.
Figure 2.
Four-way Venn diagram showing distribution of origin of all identified peptides (A) and of all candidate peptides in the search database (B).
Figure 3.
Figure 3.
(A) Cumulative gene identification rate as a function of the number of potential identifiable (hypothetical) peptides per protein-coding gene. (B) As before, but analysis for protein-coding exons. Note that considered peptides were fully tryptic, ranged from eight to 30 residues and were unique to a genomic locus. (C) Inverse cumulative validation rate of all protein-coding genes as a function of the number of peptides identified per gene. (D) As before, but for protein-coding exons.
Figure 4.
Figure 4.
Correlation analysis between the number of identified peptides and the number of potential identifiable peptides per gene. Since many data points have the same xy-values, the number of overlaying data points (genes) is encoded with the color gradient available from the legend.
Figure 5.
Figure 5.
MS PSMs confirm the protein-coding potential of five alternatively translated products of the UDP-glucuronosyltransferase 1 family, polypeptide A6 (highlighted in bold). Ambiguous PSMs are shown for the two alternatively spliced transcripts of the Ugt1a6a and Ugt1a6b genes, respectively; and as clusters for each of the 3′ exons.
Figure 6.
Figure 6.
(A) Species-specific EST and peptide evidence supports the annotation of a protein-coding object at this locus. (B) Clustal alignment of the polypeptide sequences of this novel single-exon object with the orthologous objects from the rat and human genomes, respectively. There is no transcriptional support for either the rat or human model; however, the CDS frame is highly conserved and intact. The residues of the mouse translation that are covered by PSMs are colored yellow.
Figure 7.
Figure 7.
(A) Mouse Ins2-Igf2 fusion object contains a valid CDS, supported by human cDNA and species-specific peptide evidence. (B) Clustal alignment of the translation of the mouse and human fusion transcripts. This human translation would be a target for the NMD pathway due to a frame-shift mutation, caused by the inclusion of an additional exon not present in the mouse transcript. The residues of the Ins2 and Igf2 polypeptides are colored blue and pink; with the known domains within each highlighted in gray.
Figure 8.
Figure 8.
(A) MS PSMs allow the annotation of OTTMUST00000089966, for which there is no full-length transcriptional support. (B) Focus on the 2664-bp exon of this transcript. Exons of this length are uncommon and are problematic for manual annotation. Translation of this exon is shown, with the positions of four PSMs that cover this exon highlighted in yellow.
Figure 9.
Figure 9.
(A) Two canonically splicing MS PSMs support the annotation of coding isoforms of a mouse Ppia processed pseudogene locus (OTTMUST00000018507). (B) Clustal alignment of parent PPIA protein (SWISS-PROT P17742), translations of both coding isoforms of the mouse Ppia processed pseudogene, and the putative translation of the pseudogene object. Residues of the coding isoforms that are covered by MS evidence are highlighted yellow. Residues of the parent polypeptide that are part of known domains are shown by colored boxes above the alignment.

References

    1. Abbott A 2010. Mouse project to find each gene's role: International Mouse Phenotyping Consortium launches with a massive funding commitment. Nature 465: 410 doi: 10.1038/465410a - PubMed
    1. Amid C, Rehaume LM, Brown KL, Gilbert JG, Dougan G, Hancock RE, Harrow JL 2009. Manual annotation and analysis of the defensin gene cluster in the C57BL/6J mouse reference genome. BMC Genomics 10: 606 doi: 10.1186/1471-2164-10-606 - PMC - PubMed
    1. Ashurst J, Chen C, Gilbert J, Jekosch K, Keenan S, Meidl P, Searle S, Stalker J, Storey R, Trevanion S, et al. 2005. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33: D459–D465 - PMC - PubMed
    1. Babushok DV, Ostertag EM, Kazazian HH Jr 2007. Current topics in genome evolution: Molecular mechanisms of new gene formation. Cell Mol Life Sci 64: 542–554 - PMC - PubMed
    1. Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J 2008. Retrocopy contributions to the evolution of the human genome. BMC Genomics 9: 466 doi: 10.1186/1471-2164-9-466 - PMC - PubMed

Publication types