Computational identification and characterization of novel genes from legumes

Michelle A Graham¹, Kevin A T Silverstein, Steven B Cannon, Kathryn A VandenBosch

Affiliations

PMID: 15266052
PMCID: PMC519039
DOI: 10.1104/pp.104.037531

Computational identification and characterization of novel genes from legumes

Michelle A Graham et al. Plant Physiol. 2004 Jul.

. 2004 Jul;135(3):1179-97.

doi: 10.1104/pp.104.037531.

Authors

Michelle A Graham¹, Kevin A T Silverstein, Steven B Cannon, Kathryn A VandenBosch

Affiliation

¹ Department of Plant Biology, University of Minnesota, St. Paul, Minnesota 55108, USA.

PMID: 15266052
PMCID: PMC519039
DOI: 10.1104/pp.104.037531

Abstract

The Fabaceae, the third largest family of plants and the source of many crops, has been the target of many genomic studies. Currently, only the grasses surpass the legumes for the number of publicly available expressed sequence tags (ESTs). The quantity of sequences from diverse plants enables the use of computational approaches to identify novel genes in specific taxa. We used BLAST algorithms to compare unigene sets from Medicago truncatula, Lotus japonicus, and soybean (Glycine max and Glycine soja) to nonlegume unigene sets, to GenBank's nonredundant and EST databases, and to the genomic sequences of rice (Oryza sativa) and Arabidopsis. As a working definition, putatively legume-specific genes had no sequence homology, below a specified threshold, to publicly available sequences of nonlegumes. Using this approach, 2,525 legume-specific EST contigs were identified, of which less than three percent had clear homology to previously characterized legume genes. As a first step toward predicting function, related sequences were clustered to build motifs that could be searched against protein databases. Three families of interest were more deeply characterized: F-box related proteins, Pro-rich proteins, and Cys cluster proteins (CCPs). Of particular interest were the >300 CCPs, primarily from nodules or seeds, with predicted similarity to defensins. Motif searching also identified several previously unknown CCP-like open reading frames in Arabidopsis. Evolutionary analyses of the genomic sequences of several CCPs in M. truncatula suggest that this family has evolved by local duplications and divergent selection.

PubMed Disclaimer

Figures

**Figure 1.**
Motif analysis of clustered sequences reveals similarity to F-box-associated domains and Pro-rich proteins. Residues that are identical throughout the proteins are shown in red. Residues conserved in more than 50% of the proteins are shown in yellow. Similar amino acid residues are shown in blue. Gaps (-) were introduced to optimize the alignments. Given the large number of sequences in the alignments, not all sequences are shown. All sequence names are preceded by a two-letter abbreviation representing the species name: At (Arabidopsis), Pd (*Prunus dulcis*), Ah (*Antirrhinum hispanicum*), Ll (*L. luteus*), La (*L. albus*), Ms (*M. sativa*), Ps (pea), Ha (sunflower), and Dc (carrot). Mt (GenBank accession) and Gm (GenBank accession) refer to EST singletons from *M. truncatula* and *G. max/soja*, respectively. MtTC and GmTC refer to TCs from *M. truncatula* and *G. max/soja*, respectively. Mtg (GenBank accession) refers to *M. truncatula* genomic sequence. Atchr (number) refers to unannotated Arabidopsis chromosomal sequence. The approximate position of the predicted start site follows the underscore. This number is based on the analysis of the Arabidopsis genome sequence (TIGR 3.0). A, Motif analysis of group 640 revealed similarity to a core 20 amino acid region within the larger F-box-associated domain. This region has been underlined. The alignment demonstrates the variability found outside this core domain. Sequences PdQ84KK3, AtQ9SFC7, and AhQ9AQW0 were identified from Swiss-Prot/TrEMBL. B, Regular expression pattern analysis of group 5 revealed similarities to Pro-rich cell wall proteins. LlPRPContig1, LaPRPContig1, GmPRPContig1, and MtPRPContig5 represent contigs assembled using the Sequencher software. GmP15642, GmP13993, GmP08012, MtQ43564, MtQ40375, MtQ40376, MsQ40358, PsQ9SC42, DcP06600, DcQ39686, DcP93705, and AtQ9LIE8 were identified from Swiss-Prot/TrEMBL.

**Figure 2.**
Motif analysis of different groups of Cys cluster proteins. Small figures below each of the alignments depict the pattern of conserved residues. Residue coloring and sequence nomenclature are the same as used in Figure 1. Gaps (-) were introduced to optimize the alignments. Given the large number of sequences in the alignments, not all sequences are shown. A, Motif analysis of nodule-specific CCP group 31.01 identified no homologous sequences from other species. B, Motif analysis of nodule-specific CCP group 31.02 identified sequences from unannotated regions of the Arabidopsis genome and from Swiss-Prot/TrEMBL. MmQ9BJX2 is a neurotoxin protein identified from scorpion; GoQ9STB7 is a hypothetical protein from *G. orientalis*. C, Prior to motif analysis, seed-specific CCP groups 645 and 40 were merged into a single alignment. The resulting motif identified sequences from the Arabidopsis genome and from Swiss-Prot/TrEMBL. Hits to Arabidopsis came from both annotated (At5g63660) and unannotated regions of the genome. Hits from Swiss-Prot/TrEMBL included a putative gamma-thionin from *Eutrema wasabi* (EwQ9FS38) and putative self-incompatibility factor LCR46 from Arabidopsis (AtP82761).

**Figure 3.**
Genomic organization of CCPs on BAC Mth2-34P9. A, Regions of BAC Mth2-34P9 with BLASTX homology less than 10⁻¹⁰ to GenBank sequences are as shown: 1, Ta11 non-Long Terminal Repeat retroelement (Arabidopsis, gi|15226160); 2, Putative retrotransposon gag protein (pea, gi|31126675); 3, POLX_ TOBAC retrovirus-related Pol polyprotein (*Nicotiana tabacum*, gi|130582); 4, Vesicle-associated membrane protein (Arabidopsis, gi|15225415); 5, Putative polyprotein (*N. tabacum*, gi|20161451); 6, Albumin 1 (*M. truncatula*, gi|3238736); 7, Protein T31J12.4 (Arabidopsis, gi|25402534); 8, Mariner transposase (*G. max*, gi|7488706); 9, Expressed protein (Arabidopsis, gi|18410357); 10, Scarecrow-like transcription factor (Arabidopsis, gi|15236725); and 11, Transposable element Tnp2 (*Antirrhinum majus*, gi|1345502. B, Organization of CCPs, repeats (R), and MRs. Each thick black line represents a repeat and is named accordingly. MRs and CCPs are shown as boxes above each repeat. MRs are shaded similarly to show that MRs in conserved positions between repeats are more similar to each other than are MRs within a repeat. The only exception is MR3-1, which is shaded differently. Expressed sequences can be identified by the asterisk following the name.

**Figure 4.**
Analysis of MRs from BAC Mth2-34P9. A, Phylogenetic analysis was performed on the aligned nucleotide sequences of MRs. Bootstrap scores are provided at each branch of the tree. In general, MRs in conserved positions across the larger repeats share greater similarity to each other than to other MRs in the same repeat. The only exception is MR3-1, which appeared to have combined features of MR1-1, MR2-1, MR1-2, and MR2-2 (shown in bold). B, The sequences of MR1-1 and MR2-1 were compared to the sequences of MR1-2 and MR2-2 to identify conserved polymorphic sites. The position and sequence of the polymorphism in each group of sequences is indicated. Dashes indicate gaps in the alignment. The sequence of MR3-1 is shown at each of these positions. From polymorphic positions 30 through 157, MR3-1 matches the sequences of MR1-1 and MR2-1. However, from polymorphic positions 215 through 550, MR3-1 matches the sequences of MR1-2 and MR2-2. This indicates that an unequal recombination event somewhere between nucleotides 158 and 214 fused two progenitors MRs together to form MR3-1.

See this image and copyright information in PMC

References

1. Albrecht C, Geurts R, Bisseling T (1999) Legume nodulation and mycorrhizae formation; two extremes in host specificity meet. EMBO J 18: 281–288 - PMC - PubMed
1. Almeida MS, Cabral KM, Zingali RB, Kurtenbach E (2000) Characterization of two novel defense peptides from pea (Pisum sativum) seeds. Arch Biochem Biophys 378: 278–286 - PubMed
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
1. Asamizu E, Nakamura Y, Sato S, Tabata S (2000) Generation of 7137 non-redundant expressed sequence tags from a legume, Lotus japonicus. DNA Res 7: 127–130 - PubMed
1. Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, et al (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31: 400–402 - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Associated data

Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Nucleotide
Actions
- Search in PubMed
- Search in Nucleotide
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Protein
Actions
- Search in PubMed
- Search in Nucleotide

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- The Arabidopsis Information Resource
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational identification and characterization of novel genes from legumes

Affiliation

Computational identification and characterization of novel genes from legumes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials