. 2005 Apr;15(4):566-76.

doi: 10.1101/gr.3030405.

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Namshin Kim¹, Seokmin Shin, Sanghyuk Lee

Affiliations

PMID: 15805497
PMCID: PMC1074371
DOI: 10.1101/gr.3030405

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Namshin Kim et al. Genome Res. 2005 Apr.

. 2005 Apr;15(4):566-76.

doi: 10.1101/gr.3030405.

Authors

Namshin Kim¹, Seokmin Shin, Sanghyuk Lee

Affiliation

¹ Division of Molecular Life Sciences, Ewha Womans University, Seoul 120-750, Korea.

PMID: 15805497
PMCID: PMC1074371
DOI: 10.1101/gr.3030405

Abstract

With the availability of the human genome map and fast algorithms for sequence alignment, genome-based EST clustering became a viable method for gene modeling. We developed a novel gene-modeling method, ECgene (Gene modeling by EST Clustering), which combines genome-based EST clustering and the transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The position of splice sites (i.e., exon-intron boundaries) in the genome map is utilized as the critical information in the whole procedure. Sequences that share any splice sites are grouped together to define an EST cluster in a manner similar to that of the genome-based version of the UniGene algorithm. Transcript assembly is achieved using graph theory that represents the exon connectivity in each cluster as a directed acyclic graph (DAG). Distinct paths along exons correspond to possible gene models encompassing all alternative splicing events. EST sequences in each cluster are subclustered further according to the compatibility with gene structure of each splice variant, and they can be regarded as clone evidence for the corresponding isoform. The reliability of each isoform is assessed from the nature of cluster members and from the minimum number of clones required to reconstruct all exons in the transcript.

PubMed Disclaimer

Figures

**Figure 1.**
Flowchart of the ECgene algorithm. The primary cluster is a collection of spliced sequences sharing at least one splice site. The alternatively spliced clusters are subclusters of a primary cluster obtained from graph-theoretic analysis of exon connectivity. Unspliced ESTs are added to produce the final ECgene gene models, which are classified into three groups according to the transcript reliability.

**Figure 2.**
Transcript assembly procedure based on the graph theory. (A) Example of genomic alignment of multi-exon sequences comprising an ECgene cluster. Exons are marked as A, B, C,..., and sequences are numbered as 1, 2, 3,... Exons A and B represent an example of alternative transcription start sites. Exons D, E, F, and G show exon-skipping events, whereas exons F and G occupy the same genomic loci with different 3′ splice sites (acceptor splice site variation). Sequence #14 shows an example of intron retention at exon I. PolyA tails are indicated as small red boxes; they do not align onto the genome. (B) Directed acyclic graph (DAG) representation of genomic alignment. Nodes and edges represent exons and introns, respectively. Exons are colored according to the type of nodes. Source nodes with outgoing arrows only are shown in brown, and terminal nodes with incoming arrows only are shown in blue. Internal nodes are colored green. (C) Transcript models and sequence members. Transcript models in the yellow boxes are the initial solutions from DFS (depth first search) that starts from one of the source nodes and ends with one of the terminal nodes. After mapping sequences onto the DFS solution, unsupported exons (indicated in red) are trimmed off and redundant transcript models are removed. This produces the intermediate gene models shown in green boxes. Then we examine sequences with a polyA tail (shown in blue letters) and ascertain that each transcript has only one polyA site. Truncation at the polyA site in sequence #2 creates a new exon, D′. Final transcript models and sequence members are shown with the MinClones. For example, the third transcript model (A-C-D-E-G-H) is a concatenation of ESTs #4 and #11, and the number of MinClones = 2.

**Figure 3.**
ECgene genome browser. (A) Dense view showing the gene structure. The ECgene ID “H13C1492.1” indicates that the gene is the 1,492nd cluster located on human chromosome 13. The variant number is appended after the ECgene ID. The title line has additional information. “[10/15/53][F][High, 1][mA][no stop codon]” means that this cluster has 53 sequences. The first variant has 15 sequence members, 10 of which are multi-exon clones. The transcript is on the sense (+) strand. It contains mRNA sequence and has polyA evidence. [High, 1] means that the transcript belongs to the ECgene Part A, and the number of MinClones = 1. (B) Expanded view showing sequence alignment. The first variant has a polyA tail on BC047568 mRNA. The third variant belongs to the ECgene Part C (Low reliability) with the number of MinClones = 4. Representative clones belonging to the minimal set are indicated with the “#” sign in front of the accession number. Information on the EST read direction and the presence of mRNA or polyA is appended to the accession number. The browser supports an option of viewing unspliced alignments. If the option of showing EST alignment is unchecked, it will show just the transcript models in a single track. The navigating bars provided in the upper window should be used to make a query to our database. Otherwise, the data in the custom tracks do not change.

See this image and copyright information in PMC

Cited by

The ASAP II database: analysis and comparative genomics of alternative splicing in 15 animal species.
Kim N, Alekseyenko AV, Roy M, Lee C. Kim N, et al. Nucleic Acids Res. 2007 Jan;35(Database issue):D93-8. doi: 10.1093/nar/gkl884. Epub 2006 Nov 15. Nucleic Acids Res. 2007. PMID: 17108355 Free PMC article.
Evolution of alternative splicing after gene duplication.
Su Z, Wang J, Yu J, Huang X, Gu X. Su Z, et al. Genome Res. 2006 Feb;16(2):182-9. doi: 10.1101/gr.4197006. Epub 2005 Dec 19. Genome Res. 2006. PMID: 16365379 Free PMC article.
Modeling transcriptome based on transcript-sampling data.
Zhu J, He F, Wang J, Yu J. Zhu J, et al. PLoS One. 2008 Feb 20;3(2):e1659. doi: 10.1371/journal.pone.0001659. PLoS One. 2008. PMID: 18286206 Free PMC article.
ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences.
Bonizzoni P, Rizzi R, Pesole G. Bonizzoni P, et al. BMC Bioinformatics. 2005 Oct 5;6:244. doi: 10.1186/1471-2105-6-244. BMC Bioinformatics. 2005. PMID: 16207377 Free PMC article.
Characterising alternate splicing and tissue specific expression in the chicken from ESTs.
Tang H, Heeley T, Morlec R, Hubbard SJ. Tang H, et al. Cytogenet Genome Res. 2007;117(1-4):268-77. doi: 10.1159/000103188. Cytogenet Genome Res. 2007. PMID: 17675868 Free PMC article.

See all "Cited by" articles

References

1. Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., et al. 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252: 1651-1656. - PubMed
1. Beaudoing, E. and Gautheret, D. 2001. Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res. 11: 1520-1526. - PMC - PubMed
1. Black, D.L. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. BioChem. 72: 291-336. - PubMed
1. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. - PubMed
1. Caceres, J.F. and Kornblihtt, A.R. 2002. Alternative splicing: Multiple control mechanisms and involvement in human disease. Trends Genet. 18: 186-193. - PubMed

Web site references

1. http://genome.ewha.ac.kr/ECgene; ECgene Web site.
1. ftp://ftp.ncbi.nlm.nih.gov/genbank/; GenBank FTP site.
1. ftp://hgdownload.cse.ucsc.edu/goldenPath/; Genome Browser FTP site at the UCSC Genome Center.
1. ftp:/ftp.ncbi.nlm.nih.gov/refseq/release/; RefSeq FTP site.
1. http://genome.ucsc.edu; UCSC Genome Bioinformatics Home.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Affiliation

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Web site references

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Web site references

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials