Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan;69(1-2):179-94.
doi: 10.1007/s11103-008-9415-4. Epub 2008 Oct 21.

Insights into corn genes derived from large-scale cDNA sequencing

Affiliations

Insights into corn genes derived from large-scale cDNA sequencing

Nickolai N Alexandrov et al. Plant Mol Biol. 2009 Jan.

Abstract

We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701-EU977132 (FLI cDNA) and FK944382-FL482108 (EST).

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Distribution of the number of clones (5′ ESTs) among clusters represented by a set of 10,084 full-length corn cDNA clones. Relatively few genes have many ESTs whereas many have only one or few ESTs. The distribution can be approximated by a power function (inset; linear function in log scale)
Fig. 2
Fig. 2
Sequence logo for the ATG consensus in corn. The logo is based on 9,920 sequences. The figure was produced using WebLogo tool (Crooks et al. 2004)
Fig. 3
Fig. 3
Distribution of GC in the coding region of corn, Arabidopsis and rice. The GC content in the coding region of corn cDNAs is bimodal and the high GC content can be explained by the abundance of GC in the third position of the codons. A similar result is observed for rice but Arabidopsis is unimodal. GC indicates the ratio of GC versus AT. GC12 represent the ratio of GC versus AT in the 1st and 2nd positions of the codons. GC3 represents the ratio of GC versus AT in the 3rd position of the codons
Fig. 4
Fig. 4
Distribution of the GC content in the third codon position of CDSs of different plant species. All grasses have a broad distribution with two peaks whereas the dicots have a unimodal distribution. All CDS sequences except corn (we used sequences described in this paper) and Arabidopsis (we used TAIR annotation) were downloaded from the J. Craig Venter Institute (JCVI, formerly known as TIGR) ftp site ftp://ftp.tigr.org/pub/data/plantta/. The number of unique transcripts for each species is: switchgrass 7,638, Arabidopsis 27,983, poplar 12,687, canola 10,709, Medicago 20,414, cotton 24,797, corn 10,084, rice 49,870, sorghum 20,714 and wheat 62,121
Fig. 5
Fig. 5
Distribution of the GC content in the third codon position among different groups of rice genes. Only those genes which have mRNA evidence in TIGR version 5 annotation (total 23,721 genes) were considered. GC3 distribution of these genes has two peaks at about 0.5 and 0.9. Genes without introns (4,452) are more prevalent in the high GC3 peak. Genes sharing similarity with Arabidopsis (blast P-value < 1.e-50, best reciprocal hit, 7,924 genes) are mostly in the lower GC3 peak whereas genes (1,664) similar to Myxobacteria (blast P-value < 1.e-3 and not matching Arabidopsis) are mostly in the high GC3 peak. 17,100 known protein sequences of the order Myxococcales from GenBank were used for comparison
Fig. 6
Fig. 6
Distribution of the exon number in corn, Arabidopsis and rice genes. 2793 of 10,084 full-length corn clones were mapped to corn genomic sequences of >20,000 bps to ascertain the number of exons. This subset is biased towards shorter genes which may overestimate frequencies of genes with a smaller number of exons and underestimate frequencies of genes with a larger number of exons. This effect can be seen in the distributions for Arabidopsis genes: one was obtained using Arabidopsis cDNA clones produced by a similar technology (Arabidopsis cDNA) and the other derived from all genes in the TAIR genome annotation having mRNA support (Arabidopsis all). Distribution of exons in rice genes were obtained from 23,721 genes with mRNA support from TIGR rice genome annotation, release 6 and are shown for comparison
Fig. 7
Fig. 7
Intron length distribution in corn, Arabidopsis and rice. Introns in both the coding and non-coding parts of the mRNA were used in this analysis. All three species have similar modes for intron length, although corn and rice genes have longer introns in average
Fig. 8
Fig. 8
Average gene expression increases with the number of introns in genes. The number of 5′ ESTs in each cluster was used to estimate expression. These 5′ ESTs were derived from primary libraries and so reasonably estimate mRNA abundance in the libraries. The greater the number of exons in a gene the greater its expression, as measured by the number of 5′ ESTs
Fig. 9
Fig. 9
Gini index for corn and Arabidopsis introns. 96% of Arabidopsis introns and 92% of corn introns have a Gini index equal to 0 meaning that there are no variants (the data point is not shown). A larger Gini index in corn means that corn transcripts are more variable
Fig. 10
Fig. 10
Relative frequencies of different types of alternative splicing events. The frequencies of different alternative splicing events were computed from the alignment of 563,251 transcripts with corn genomic sequences. 289,608 mapped transcripts are from Ceres libraries, the other 273,643 transcript sequences were downloaded from GenBank
Fig. 11
Fig. 11
Distribution of nucleotides around the Transcription Start Site of corn based on 5,200 promoters that have TSSs predicted by at least four 5′ ESTs. There is a peak of A/T at position -30 and a peak of C/A just prior to the TSS
Fig. 12
Fig. 12
The most significant words in corn (a) and in Arabidopsis (b) promoters. The analysis is performed on a subset of 5,200 corn promoters and 5,050 Arabidopsis promoters that have TSS predicted by at least four 5′ ESTs. For Corn, there is a prominent CA peak at the TSS and a smaller TATA motif at position -30. This is in sharp contrast to Arabidopsis where TATA is more frequent than CA
Fig. 13
Fig. 13
Frequency of a TATA box in promoters of different strengths. Strong and weak promoters have a TATA box more often than genes with average expression. In corn, TATA boxes are more frequent in stronger genes whereas in Arabidopsis TATA boxes are more frequent in weaker promoters
Fig. 14
Fig. 14
CG skew plot for corn TSSs calculated as average CG skew in a sliding window of 40 nucleotides. The CG skew observed for corn is similar to what we have previously observed for Arabidopsis
Fig. 15
Fig. 15
Corn proteins are more similar to rice than to Arabidopsis. The few exceptions are due to genes missed in the rice annotation, random fluctuations and possible contamination of corn cDNA clones by cDNAs from other organisms. 10,084 corn proteins, TAIR Arabidopsis genome annotation and TIGR rice annotation were used for comparison. Only matches with P-value ≤ 1.e-10, covering at least 70% of the protein length are shown in the plot

References

    1. {'text': '', 'ref_index': 1, 'ids': [{'type': 'DOI', 'value': '10.1007/s11103-005-2564-9', 'is_inner': False, 'url': 'https://doi.org/10.1007/s11103-005-2564-9'}, {'type': 'PubMed', 'value': '16463100', 'is_inner': True, 'url': 'https://pubmed.ncbi.nlm.nih.gov/16463100/'}]}
    2. Alexandrov NN, Troukhan ME et al (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol 60(1):69–85. doi:10.1007/s11103-005-2564-9 - PubMed
    1. {'text': '', 'ref_index': 1, 'ids': [{'type': 'DOI', 'value': '10.1073/pnas.93.24.13919', 'is_inner': False, 'url': 'https://doi.org/10.1073/pnas.93.24.13919'}, {'type': 'PMC', 'value': 'PMC19468', 'is_inner': False, 'url': 'https://pmc.ncbi.nlm.nih.gov/articles/PMC19468/'}, {'type': 'PubMed', 'value': '8943036', 'is_inner': True, 'url': 'https://pubmed.ncbi.nlm.nih.gov/8943036/'}]}
    2. Beletskii A, Bhagwat AS (1996) Transcription-induced mutations: increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli. Proc Natl Acad Sci USA 93(24):13919–13924. doi:10.1073/pnas.93.24.13919 - PMC - PubMed
    1. {'text': '', 'ref_index': 1, 'ids': [{'type': 'DOI', 'value': '10.1104/pp.104.040071', 'is_inner': False, 'url': 'https://doi.org/10.1104/pp.104.040071'}, {'type': 'PMC', 'value': 'PMC514112', 'is_inner': False, 'url': 'https://pmc.ncbi.nlm.nih.gov/articles/PMC514112/'}, {'type': 'PubMed', 'value': '15173566', 'is_inner': True, 'url': 'https://pubmed.ncbi.nlm.nih.gov/15173566/'}]}
    2. Berardini TZ, Mundodi S et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135(2):745–755. doi:10.1104/pp.104.040071 - PMC - PubMed
    1. {'text': '', 'ref_index': 1, 'ids': [{'type': 'DOI', 'value': '10.1094/PDIS.2002.86.8.889', 'is_inner': False, 'url': 'https://doi.org/10.1094/pdis.2002.86.8.889'}, {'type': 'PubMed', 'value': '30818644', 'is_inner': True, 'url': 'https://pubmed.ncbi.nlm.nih.gov/30818644/'}]}
    2. Bull CT, Shetty KG, Subbarao KV (2002) Interactions between Myxobacteria, plant pathogenic fungi, and biocontrol agents. Plant Dis 86:889–896. doi:10.1094/PDIS.2002.86.8.889 - PubMed
    1. {'text': '', 'ref_index': 1, 'ids': [{'type': 'DOI', 'value': '10.1104/pp.92.1.1', 'is_inner': False, 'url': 'https://doi.org/10.1104/pp.92.1.1'}, {'type': 'PMC', 'value': 'PMC1062239', 'is_inner': False, 'url': 'https://pmc.ncbi.nlm.nih.gov/articles/PMC1062239/'}, {'type': 'PubMed', 'value': '16667228', 'is_inner': True, 'url': 'https://pubmed.ncbi.nlm.nih.gov/16667228/'}]}
    2. Campbell WH, Gowri G (1990) Codon usage in higher plants, green algae, and cyanobacteria. Plant Physiol 92(1):1–11 - PMC - PubMed

Publication types

LinkOut - more resources