Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct;17(5):271-9.
doi: 10.1093/dnares/dsq017. Epub 2010 Jul 28.

Efficient plant gene identification based on interspecies mapping of full-length cDNAs

Affiliations

Efficient plant gene identification based on interspecies mapping of full-length cDNAs

Naoki Amano et al. DNA Res. 2010 Oct.

Abstract

We present an annotation pipeline that accurately predicts exon-intron structures and protein-coding sequences (CDSs) on the basis of full-length cDNAs (FLcDNAs). This annotation pipeline was used to identify genes in 10 plant genomes. In particular, we show that interspecies mapping of FLcDNAs to genomes is of great value in fully utilizing FLcDNA resources whose availability is limited to several species. Because low sequence conservation at 5'- and 3'-ends of FLcDNAs between different species tends to result in truncated CDSs, we developed an improved algorithm to identify complete CDSs by the extension of both ends of truncated CDSs. Interspecies mapping of 71 801 monocot FLcDNAs to the Oryza sativa genome led to the detection of 22 142 protein-coding regions. Moreover, in comparing two mapping programs and three ab initio prediction programs, we found that our pipeline was more capable of identifying complete CDSs. As demonstrated by monocot interspecies mapping, in which nucleotide identity between FLcDNAs and the genome was ∼80%, the resultant inferred CDSs were sufficiently accurate. Finally, we applied both inter- and intraspecies mapping to 10 monocot and dicot genomes and identified genes in 210 551 loci. Interspecies mapping of FLcDNAs is expected to effectively predict genes and CDSs in newly sequenced genomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An overview of the interspecies mapping algorithm.
Figure 2
Figure 2
Problems with interspecies mapping. Alignment errors between a given FLcDNA and genome sequence pair have three possible causes. (A) Multiple duplicated genes encompassed by a single cDNA. (B) Erroneously short introns. (C) Alignment errors around splice sites.
Figure 3
Figure 3
Relationship between species classification and mapping ratio. The horizontal axis indicates the classification, and the vertical axis indicates the mapping ratio. We mapped FLcDNAs from three monocots and four dicots to the O. sativa (rice) and Z. mays (maize) genomes, and FLcDNAs from three dicots and six monocots to the A. thaliana (Arabidopsis) genome. Bars at the top of the boxes represent the standard deviations.
Figure 4
Figure 4
Correlation between nucleotide identity and SP. The horizontal axis indicates the nucleotide identity, and the vertical axis indicates the SP of all introns in CDSs. Open circles and triangles indicate within-monocot and -dicot mapping, respectively, and filled circles and triangles indicate between-dicot and -monocot mapping, respectively. The straight line shows the linear regression for all data (r = 0.88).
Figure 5
Figure 5
Examples of interspecies mapping to the O. sativa genome. Oryza sativa exon–intron structures (RAP representative) were retrieved from RAP-DB. The two characters before the FLcDNA accession numbers indicate the species names: HV for H. vulgare and ZM for Z. mays. Black, red, and yellow regions represent CDSs, extended CDSs, and UTR regions, respectively. (A) The same exon–intron structures between an O. sativa FLcDNA (INSDC: AK067543) and Z. mays FLcDNAs with extended CDS regions. (B) A truncated FLcDNA of O. sativa (INSDC: AK106806) and a complete structure predicted by an H. vulgare FLcDNA. (C) A new locus identified by a H. vulgare FLcDNA (INSDC: AK248420) in a region between Os08g0206900 and Os08g0207000.

Similar articles

Cited by

References

    1. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. doi:10.1038/35048692. - DOI - PubMed
    1. International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005;436:793–800. doi:10.1038/nature03895. - DOI - PubMed
    1. Paterson A.H., Bowers J.E., Bruggmann R., et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457:551–6. doi:10.1038/nature07723. - DOI - PubMed
    1. Schnable P.S., Ware D., Fulton R.S., et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5. doi:10.1126/science.1178534. - DOI - PubMed
    1. The International Brachypodium Initiative. Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature. 2010;463:763–8. doi:10.1038/nature08747. - DOI - PubMed

Publication types

MeSH terms