Discovery and revision of Arabidopsis genes by proteogenomics
- PMID: 19098097
- PMCID: PMC2605632
- DOI: 10.1073/pnas.0811066106
Discovery and revision of Arabidopsis genes by proteogenomics
Abstract
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.
Conflict of interest statement
The authors declare no conflict of interest.
Figures




Similar articles
-
Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics.J Proteomics. 2013 Jun 28;86:27-42. doi: 10.1016/j.jprot.2013.04.036. Epub 2013 May 9. J Proteomics. 2013. PMID: 23665149
-
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].Yi Chuan Xue Bao. 2004 May;31(5):431-43. Yi Chuan Xue Bao. 2004. PMID: 15478601 Chinese.
-
A comparative proteomics resource: proteins of Arabidopsis thaliana.Genome Biol. 2003;4(8):R51. doi: 10.1186/gb-2003-4-8-r51. Epub 2003 Jul 28. Genome Biol. 2003. PMID: 12914659 Free PMC article.
-
Arabidopsis thaliana proteomics: from proteome to genome.J Exp Bot. 2006;57(7):1485-91. doi: 10.1093/jxb/erj130. Epub 2006 Mar 21. J Exp Bot. 2006. PMID: 16551684 Review.
-
Proteogenomics to discover the full coding content of genomes: a computational perspective.J Proteomics. 2010 Oct 10;73(11):2124-35. doi: 10.1016/j.jprot.2010.06.007. Epub 2010 Jul 8. J Proteomics. 2010. PMID: 20620248 Free PMC article. Review.
Cited by
-
Functional and evolutionary analysis of the Arabidopsis 4R-MYB protein SNAPc4 as part of the SNAP complex.Plant Physiol. 2021 Apr 2;185(3):1002-1020. doi: 10.1093/plphys/kiaa067. Plant Physiol. 2021. PMID: 33693812 Free PMC article.
-
Positional correlation analysis improves reconstruction of full-length transcripts and alternative isoforms from noisy array signals or short reads.Bioinformatics. 2012 Apr 1;28(7):929-37. doi: 10.1093/bioinformatics/bts065. Epub 2012 Feb 13. Bioinformatics. 2012. PMID: 22332235 Free PMC article.
-
Commentary: Primary Transcripts of microRNAs Encode Regulatory Peptides.Front Plant Sci. 2016 Sep 22;7:1436. doi: 10.3389/fpls.2016.01436. eCollection 2016. Front Plant Sci. 2016. PMID: 27713758 Free PMC article. No abstract available.
-
Could Causal Discovery in Proteogenomics Assist in Understanding Gene-Protein Relations? A Perennial Fruit Tree Case Study Using Sweet Cherry as a Model.Cells. 2021 Dec 29;11(1):92. doi: 10.3390/cells11010092. Cells. 2021. PMID: 35011654 Free PMC article.
-
Unpredictability of metabolism--the key role of metabolomics science in combination with next-generation genome sequencing.Anal Bioanal Chem. 2011 Jun;400(7):1967-78. doi: 10.1007/s00216-011-4948-9. Epub 2011 May 10. Anal Bioanal Chem. 2011. PMID: 21556754 Free PMC article. Review.
References
-
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. - PubMed
-
- Brent MR. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet. 2008;9:62–73. - PubMed
-
- Baerenfaller K, et al. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008;320:938–941. - PubMed
-
- Brunner E, et al. A high-quality catalog of the Drosophila melanogaster proteome. Nat Biotechnol. 2007;25:576–583. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases