Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Feb 10:6:25.
doi: 10.1186/1471-2105-6-25.

Integrating alternative splicing detection into gene prediction

Affiliations

Integrating alternative splicing detection into gene prediction

Sylvain Foissac et al. BMC Bioinformatics. .

Abstract

Background: Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders.

Results: We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGENE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage).

Conclusions: This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.

PubMed Disclaimer

Figures

Figure 1
Figure 1
EST/cDNA alignments on the spl7 gene region. Thick lines represent matches an dotted lines, gaps. Above the genomic sequence, the 2 full-length cDNAs that provide the two correct reference gene structures are presented. Arrows indicate the start and stop codons. The ESTs T04465 and AI995153 present inconsistent splicing profiles and are labeled as incompatible.
Figure 2
Figure 2
EuGène's directed acyclic graph for a short example sequence. For simplicity purposes, only the forward strand is considered. The DNA sequence is shown above the graph. Horizontal tracks represent the different possible annotations: intergenic (bottom), UTR 5' and 3', exon in the 3 frames, intron in 3 phases (the phase of an intron is defined according to the splicing position in the last codon of the previous exon). On each track, 2 vertices are used to represent each nucleotide. These 2 vertices are linked horizontally by a contents and a transition edge (see the text and Figure 4 for details). Dotted arrows show occurrences of biological signals (like start/stop codons and donor/acceptor splice sites). They produce additional transition edges at the corresponding position. Since this version of EuGÈNE does not include any promoter or polyA site prediction tool, transitions from intergenic to UTR and vice-versa are allowed at every nucleotide position. All consistent gene structures can be represented by a path connecting the initial and terminal vertices formula image and formula image.
Figure 3
Figure 3
Detail of EuGène's directed acyclic graph and algorithm. The zoomed region contains the two first nucleotides of the example sequence of Figure 3 (C at position i - 1, and A at position i), and two annotation tracks (UTR5' for j and exon in frame 2 for j + 1). The contents edges c connect the l vertices to the following r vertices of the same track. Transition edges t are either horizontal and link the r vertices to the l vertices of the same track, or transversal and link the r vertices to all possible l vertices according to the occurrence of a biological signal in the sequence. In this example, between formula image and formula image a vertex formula image allows the transition from the UTR5' track at position i - 1 to the exonic track at i because the A nucleotide at position i is the first nucleotide of a potential start codon ATG. The dynamic programming algorithm used in EuGÈNE determines, for each vertex r, which vertex precedes r in the optimal path. In this example, at position i for the track j the best path leading to formula image from the left has a weight formula image (only one origin is possible). For the track j + 1, the best path leading to formula image will be attributed a weight of either formula image, whatever the lower.
Figure 4
Figure 4
Extension of EuGène's graph by a PCS to incorporate a single alternative transcript alignment. From the main graph (bottom) described in Figure 3, a Parallel Graph Subunit (PGS) is built (above) by duplicating the whole graph section involved in the EST alignment (between the graphs). Gene structure evidences provided by the alignment are taken into account in the PGS by forbidding the intergenic track all along the alignment, intronic tracks at match positions (light grey), and exonic tracks in gap positions (dark grey). Dotted arrows represent the two algorithm scans, the forward version from left to right, and the backward version from right to left. At the junction point in the PGS, an optimal prediction is obtained. Figure not to scale.
Figure 5
Figure 5
Integration of several incompatible ESTs in EuGène-M's graph and algorithm. A: EST alignments (plain lines represent exons, dotted lines, intron) on a genomic sequence (thick line). Each displayed EST is incompatible with at least another one. B: Multiple extensions of EuGÈNE's graph model after having processed these alignments. Each PGS (Figure 3) contains the information provided by its source EST. The dotted arrows show the algorithm progression through the resulting graph during the first scan, from the left to the right.

Similar articles

Cited by

References

    1. Modrek B, Lee C. A genomic view of alternative splicing. Nat Genet. 2002;30:13–9. doi: 10.1038/ng0102-13. - DOI - PubMed
    1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Johnson J, Castle J, Garrett-Engele P, Kan Z, Loerch P, Armour C, Santos R, Schadt E, Stoughton R, Shoemaker D. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003;302:2141–4. doi: 10.1126/science.1090100. - DOI - PubMed
    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. doi: 10.1006/jmbi.1997.0951. - DOI - PubMed
    1. Krogh A. Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. 2000;10:391–7. doi: 10.1101/gr.10.4.523. - DOI - PMC - PubMed

MeSH terms