Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2006 Aug 9:7:204.
doi: 10.1186/1471-2164-7-204.

Identification and analysis of gene families from the duplicated genome of soybean using EST sequences

Affiliations
Review

Identification and analysis of gene families from the duplicated genome of soybean using EST sequences

Rex T Nelson et al. BMC Genomics. .

Abstract

Background: Large scale gene analysis of most organisms is hampered by incomplete genomic sequences. In many organisms, such as soybean, the best source of sequence information is the existence of expressed sequence tag (EST) libraries. Soybean has a large (1115 Mbp) genome that has yet to be fully sequenced. However it does have the 6th largest EST collection comprised of ESTs from a variety of soybean genotypes. Many EST libraries were constructed from RNA extracted from various genetic backgrounds, thus gene identification from these sources is complicated by the existence of both gene and allele sequence differences. We used the ESTminer suite of programs to identify potential soybean gene transcripts from a single genetic background allowing us to observe functional classifications between gene families as well as structural differences between genes and gene paralogs within families. The identification of potential gene sequences (pHaps) from soybean allows us to begin to get a picture of the genomic history of the organism as well as begin to observe the evolutionary fates of gene copies in this highly duplicated genome.

Results: We identified approximately 45,000 potential gene sequences (pHaps) from EST sequences of Williams/Williams82, an inbred genotype of soybean (Glycine max L. Merr.) using a redundancy criterion to identify reproducible sequence differences between related genes within gene families. Analysis of these sequences revealed single base substitutions and single base indels are the most frequently observed form of sequence variation between genes within families in the dataset. Genomic sequencing of selected loci indicate that intron-like intervening sequences are numerous and are approximately 220 bp in length. Functional annotation of gene sequences indicate functional classifications are not randomly distributed among gene families containing few or many genes.

Conclusion: The predominance of single nucleotide insertion/deletions and substitution events between genes within families (individual genes and gene paralogs) is consistent with a model of gene amplification followed by single base random mutational events expected under the classical model of duplicated gene evolution. Molecular functions of small and large gene families appear to be non-randomly distributed possibly indicating a difference in retention of duplicates or local expansion.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of insertion lengths and consecutive substitutions within gene families. A) Consecutive base substitutions demonstrate that single base substitutions are the primary size class, consisting of 90% of all substitutions which reduce in number rapidly. The largest consecutive stretch of substitutions was 12. B) Insertion lengths in terms of percent of the total number of insertion events are shown. Insertion lengths demonstrate excess insertions of lengths 3, 6, and 9 bases however, the largest size class is single base Insertions which compose 58% of all insertion events. The data shown is for insertions less than 16 bases in length.
Figure 2
Figure 2
Distribution of GOslim terms among gene families. Histogram of GOslim terms associated with all gene families. Red bars indicate gene families with multiple genes and blue bars represent gene families which were composed of a single gene. A single asterisk indicates a significant departure from independence in a Chi-square test (1df, p ≤ 0.05) and a double asterisk indicates a probability level of p ≤ 0.01. In general, families composed of few genes (single) made up the majority of all family types in each category with the exception of the categories of structural molecule activity, RNA and lipid binding where multiple gene families appear to be in the majority.
Figure 3
Figure 3
Distribution of GOslim terms among individual genes. Histogram ofGOslim terms associated with all genes. Red bars indicate genes from multiple gene families (multiple) and blue bars represent genes from families with few members (single). Asterisks indicate comparisons where multiple gene families contained more (Red) or fewer (Blue) members than expected. Significance was judged at the 0.05 probability level (single asterisk) using a Chi-square test in each category. Double asterisks indicates significance at the 0.01 probability level. Genes from multiple member families are the predominant class of genes in each category. The pHaps were not randomly distributed among the GO categories with proteins involved in kinase, hydrolase, oxygen binding, transcription regulator, nuclease, signal transducer and transcription factor activities appearing to contain fewer members than the average multiple gene family while families in the categories of enzyme regulator structural molecule and catalytic activity and receptor, protein and lipid binding appear to have larger than average multiple gene families.

Similar articles

Cited by

References

    1. Arumuganathan K, Earle ED. Estimation of nuclear DNA content of plants by flow cytometry. Plant Mol Biol Rep. 1991;9:229–241.
    1. Shoemaker RC, Keim P, Vodkin L, Retzel E, Clifton SW, Waterson R, Smoller D, Coryveil V, Khanna A, Erpelding J. A compilation of soybean ESTs: generation and analysis. Genome. 2002;45:329–338. doi: 10.1139/g01-150. - DOI - PubMed
    1. Rudd S. Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci. 2003;8:321–329. doi: 10.1016/S1360-1385(03)00131-6. - DOI - PubMed
    1. Shoemaker RC, Polzin K, Labate J, Specht J, Brummer EC, Olsen T, Young N, Concibido V, Wilcox J, Tamulonis JP, Kochert G, Boerma HR. Genome duplication in soybean (Glycine subgenus soja) Genetics. 1996;144:329–338. - PMC - PubMed
    1. Lee JM, Grant D, Vallejos CE, Shoemaker RC. Genome organization in dicots. II. Arabidopsis as a 'bridging species' to resolve genome evolution events among legumes. Theor Appl Genet. 2001;103:765–773. doi: 10.1007/s001220100658. - DOI