Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 20:4:208.
doi: 10.3389/fpls.2013.00208. eCollection 2013.

Detailed analysis of putative genes encoding small proteins in legume genomes

Affiliations

Detailed analysis of putative genes encoding small proteins in legume genomes

Gabriel Guillén et al. Front Plant Sci. .

Abstract

Diverse plant genome sequencing projects coupled with powerful bioinformatics tools have facilitated massive data analysis to construct specialized databases classified according to cellular function. However, there are still a considerable number of genes encoding proteins whose function has not yet been characterized. Included in this category are small proteins (SPs, 30-150 amino acids) encoded by short open reading frames (sORFs). SPs play important roles in plant physiology, growth, and development. Unfortunately, protocols focused on the genome-wide identification and characterization of sORFs are scarce or remain poorly implemented. As a result, these genes are underrepresented in many genome annotations. In this work, we exploited publicly available genome sequences of Phaseolus vulgaris, Medicago truncatula, Glycine max, and Lotus japonicus to analyze the abundance of annotated SPs in plant legumes. Our strategy to uncover bona fide sORFs at the genome level was centered in bioinformatics analysis of characteristics such as evidence of expression (transcription), presence of known protein regions or domains, and identification of orthologous genes in the genomes explored. We collected 6170, 10,461, 30,521, and 23,599 putative sORFs from P. vulgaris, G. max, M. truncatula, and L. japonicus genomes, respectively. Expressed sequence tags (ESTs) available in the DFCI Gene Index database provided evidence that ~one-third of the predicted legume sORFs are expressed. Most potential SPs have a counterpart in a different plant species and counterpart regions or domains in larger proteins. Potential functional sORFs were also classified according to a reduced set of GO categories, and the expression of 13 of them during P. vulgaris nodule ontogeny was confirmed by qPCR. This analysis provides a collection of sORFs that potentially encode for meaningful SPs, and offers the possibility of their further functional evaluation.

Keywords: gene annotation; legume genomes; short open reading frames.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sequence similarity of 3′-non-coding sequences of putative P. vulgaris sORFs to the Arabidopsis thaliana protein collection. 1 kb non-coding sequences (sense- and anti-sense strands) downstream stop codons of putative P. vulgaris sORFs are plotted as a function of similarity to A. thaliana proteins (e-10 value).
Figure 2
Figure 2
Length distribution of predicted protein sequences in legume and non-legume plant genomes. Pv, Phaseolus vulgaris protein sizes in P. vulgaris v0.9 annotation; Gm, Glycine max protein sizes in G. max v1.0 annotation; Mt, M. truncatula protein sizes in Mt3.0 annotation; Lj, L. japonicus protein sizes in Lj1.0 annotation; At, Arabidopsis thaliana protein sizes in genome release 9; and Zm, Zea mays protein sizes in Maize Golden Path B73 RefGen_v2.
Figure 3
Figure 3
RNA sizes for different ranges of protein size represented in a box and whisker plot. The center lines indicate the medians, the top and bottom of each box indicate the first and third quartiles, and the whiskers extend to the most extreme data points.
Figure 4
Figure 4
Legume sORFs display common aa regions or domains with larger polypeptides of the same genome. The identity level of P. vulgaris, G. max, M. truncatula, and L. japonicus predicted sORFs (peptide sequence coverage) is spread across several homologous proteins of variable size (protein sequence coverage) of the respective genome. As an example, (A) illustrates the distribution pattern of sORFs in P. vulgaris that are identical in sequence to other small proteins (slightly larger than 120 aa); in (B) sORFs that share a domain with larger polypeptides are included; and in (C) sORFs that are completely equivalent to regions or domains found in larger proteins are indicated.
Figure 5
Figure 5
Venn diagram representing the distribution of GO categories found in each legume genome. Around 15% of all sORFs in legumes were included in “response to stimulus” and close to 20% were related to “localization” GO categories. The Fisher's exact test (Routledge, 1998) was applied to determine which GO categories were statistically over-represented compared to all proteins of the genome (p < 0.05, corrected by Benjamini adjustment).
Figure 6
Figure 6
Some sORFs in P. vulgaris are shared with other plants. The graph shows the number of P. vulgaris sORFs exclusively found in this plant (Pv) compared to those that are also present in G. max (Gm), M. truncatula (Mt), L. japonicus (Lj), A. thaliana (At) and Z. mays (Zm). It also shows the number of sORFs of legumes that form determinate (LegDN) or undeterminate (Leg) nodules. The All plants bar represents the number of sORFs that are common to all plant species evaluated.
Figure 7
Figure 7
sORFs expression during nodule ontogeny. The gene expression of a small group of sORFs was confirmed by qPCR. Relative expression levels of a selected group of sORFs (Table 7) were determined in nodules and nodule-stripped roots at the indicated times by qPCR. Total RNA was isolated from each biological sample. First strand cDNA was synthesized and subjected to qPCR as described in Materials and Methods. Expression levels were normalized against Elongation factor 1-alpha (Ef1-α) values. Ratios of expression in nodule-stripped roots to nodules are graphed. These values represent the mean and SD of triplicate experiments.
Figure 8
Figure 8
Evidence of functional SPs in P. vulgaris. Out of 6170 annotated sORFs in the genome of P. vulgaris, 2336 had expression evidence (DFCI Gene Index database), 2929 shared common regions or domains with other proteins (larger than 120 aa) of P. vulgaris and 3274 were homologous to SPs found in different plant species. According to the Phytozome annotation, 4970 belong to one or more protein families. 2553 sORFs in P. vulgaris have at least one of these types of evidence of functionality, whereas 2321 have two of them and a total of 776 sORFs have all of them.

Similar articles

Cited by

References

    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Arabidopsis Genome Initiative. (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 10.1038/35048692 - DOI - PubMed
    1. Ashburner M., Ball C. A., Blake J. A., Butler H., Cherry J. M., Corradi J., et al. (2001). Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 10.1101/gr.180801 - DOI - PMC - PubMed
    1. Cannon S. B., May G. D., Jackson S. A. (2009). Three sequenced legume genomes and many crop species: rich opportunities for translational genomics. Plant Physiol. 151, 970–977 10.1104/pp.109.144659 - DOI - PMC - PubMed
    1. Cannon S. B., Sterck L., Rombauts S., Sato S., Cheung F., Gouzy J., et al. (2006). Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc. Natl. Acad. Sci. U.S.A. 103, 14959–14964 10.1073/pnas.0603228103 - DOI - PMC - PubMed