Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Apr;15(4):566-76.
doi: 10.1101/gr.3030405.

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Affiliations

ECgene: genome-based EST clustering and gene modeling for alternative splicing

Namshin Kim et al. Genome Res. 2005 Apr.

Abstract

With the availability of the human genome map and fast algorithms for sequence alignment, genome-based EST clustering became a viable method for gene modeling. We developed a novel gene-modeling method, ECgene (Gene modeling by EST Clustering), which combines genome-based EST clustering and the transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The position of splice sites (i.e., exon-intron boundaries) in the genome map is utilized as the critical information in the whole procedure. Sequences that share any splice sites are grouped together to define an EST cluster in a manner similar to that of the genome-based version of the UniGene algorithm. Transcript assembly is achieved using graph theory that represents the exon connectivity in each cluster as a directed acyclic graph (DAG). Distinct paths along exons correspond to possible gene models encompassing all alternative splicing events. EST sequences in each cluster are subclustered further according to the compatibility with gene structure of each splice variant, and they can be regarded as clone evidence for the corresponding isoform. The reliability of each isoform is assessed from the nature of cluster members and from the minimum number of clones required to reconstruct all exons in the transcript.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Flowchart of the ECgene algorithm. The primary cluster is a collection of spliced sequences sharing at least one splice site. The alternatively spliced clusters are subclusters of a primary cluster obtained from graph-theoretic analysis of exon connectivity. Unspliced ESTs are added to produce the final ECgene gene models, which are classified into three groups according to the transcript reliability.
Figure 2.
Figure 2.
Transcript assembly procedure based on the graph theory. (A) Example of genomic alignment of multi-exon sequences comprising an ECgene cluster. Exons are marked as A, B, C,..., and sequences are numbered as 1, 2, 3,... Exons A and B represent an example of alternative transcription start sites. Exons D, E, F, and G show exon-skipping events, whereas exons F and G occupy the same genomic loci with different 3′ splice sites (acceptor splice site variation). Sequence #14 shows an example of intron retention at exon I. PolyA tails are indicated as small red boxes; they do not align onto the genome. (B) Directed acyclic graph (DAG) representation of genomic alignment. Nodes and edges represent exons and introns, respectively. Exons are colored according to the type of nodes. Source nodes with outgoing arrows only are shown in brown, and terminal nodes with incoming arrows only are shown in blue. Internal nodes are colored green. (C) Transcript models and sequence members. Transcript models in the yellow boxes are the initial solutions from DFS (depth first search) that starts from one of the source nodes and ends with one of the terminal nodes. After mapping sequences onto the DFS solution, unsupported exons (indicated in red) are trimmed off and redundant transcript models are removed. This produces the intermediate gene models shown in green boxes. Then we examine sequences with a polyA tail (shown in blue letters) and ascertain that each transcript has only one polyA site. Truncation at the polyA site in sequence #2 creates a new exon, D′. Final transcript models and sequence members are shown with the MinClones. For example, the third transcript model (A-C-D-E-G-H) is a concatenation of ESTs #4 and #11, and the number of MinClones = 2.
Figure 3.
Figure 3.
ECgene genome browser. (A) Dense view showing the gene structure. The ECgene ID “H13C1492.1” indicates that the gene is the 1,492nd cluster located on human chromosome 13. The variant number is appended after the ECgene ID. The title line has additional information. “[10/15/53][F][High, 1][mA][no stop codon]” means that this cluster has 53 sequences. The first variant has 15 sequence members, 10 of which are multi-exon clones. The transcript is on the sense (+) strand. It contains mRNA sequence and has polyA evidence. [High, 1] means that the transcript belongs to the ECgene Part A, and the number of MinClones = 1. (B) Expanded view showing sequence alignment. The first variant has a polyA tail on BC047568 mRNA. The third variant belongs to the ECgene Part C (Low reliability) with the number of MinClones = 4. Representative clones belonging to the minimal set are indicated with the “#” sign in front of the accession number. Information on the EST read direction and the presence of mRNA or polyA is appended to the accession number. The browser supports an option of viewing unspliced alignments. If the option of showing EST alignment is unchecked, it will show just the transcript models in a single track. The navigating bars provided in the upper window should be used to make a query to our database. Otherwise, the data in the custom tracks do not change.

Similar articles

Cited by

References

    1. Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., et al. 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252: 1651-1656. - PubMed
    1. Beaudoing, E. and Gautheret, D. 2001. Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res. 11: 1520-1526. - PMC - PubMed
    1. Black, D.L. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. BioChem. 72: 291-336. - PubMed
    1. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. - PubMed
    1. Caceres, J.F. and Kornblihtt, A.R. 2002. Alternative splicing: Multiple control mechanisms and involvement in human disease. Trends Genet. 18: 186-193. - PubMed

Web site references

    1. http://genome.ewha.ac.kr/ECgene; ECgene Web site.
    1. ftp://ftp.ncbi.nlm.nih.gov/genbank/; GenBank FTP site.
    1. ftp://hgdownload.cse.ucsc.edu/goldenPath/; Genome Browser FTP site at the UCSC Genome Center.
    1. ftp:/ftp.ncbi.nlm.nih.gov/refseq/release/; RefSeq FTP site.
    1. http://genome.ucsc.edu; UCSC Genome Bioinformatics Home.

Publication types