Pairagon+N-SCAN_EST: a model-based gene annotation pipeline

Manimozhiyan Arumugam¹, Chaochun Wei, Randall H Brown, Michael R Brent

Affiliations

PMID: 16925839
PMCID: PMC1810554
DOI: 10.1186/gb-2006-7-s1-s5

Pairagon+N-SCAN_EST: a model-based gene annotation pipeline

Manimozhiyan Arumugam et al. Genome Biol. 2006.

. 2006;7 Suppl 1(Suppl 1):S5.1-10.

doi: 10.1186/gb-2006-7-s1-s5. Epub 2006 Aug 7.

Authors

Manimozhiyan Arumugam¹, Chaochun Wei, Randall H Brown, Michael R Brent

Affiliation

¹ Laboratory for Computational Genomics and Department of Computer Science, Washington University, One Brookings Drive, St, Louis, MO 63130, USA.

PMID: 16925839
PMCID: PMC1810554
DOI: 10.1186/gb-2006-7-s1-s5

Abstract

Background: This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets.

Results: On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean.

Conclusion: With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments.

PubMed Disclaimer

Figures

**Figure 1**
PairHMM state diagrams of Pairagon. **(a)** Alignment model and **(b)** Null model. RG1 and RG2 are unaligned genomic sequences in the 5' and 3' ends, respectively; RC1 and RC2 are unaligned cDNA sequences in the 5' and 3' ends, respectively; A, aligned; Entry corresponds to the first two bases of an intron; Exit corresponds to the last two bases of an intron; G, genomic insertion; C, cDNA insertion; RG and RC are random genomic and cDNA sequences, respectively. States that can start an alignment are marked with an asterisk and states that can end an alignment are marked with a dagger.

**Figure 2**
Block diagram of the Pairagon+N-SCAN_EST pipeline. The bold arrows mark the section of the flowchart corresponding to N-SCAN gene prediction.

**Figure 3**
An annotated GC donor site that ENSEMBL misses. There is a GT dinucleotide four nucleotides downstream of the GC donor site (both dinucleotides are marked brown in the sequence). Pairagon identifies the correct donor site. (Screen shot obtained from UCSC Genome Browser web site [23].)

**Figure 4**
Generating the search subspace given three high-scoring segment pairs (HSPs) in the Stepping Stone algorithm. The three diagonal lines represent the three HSPs. The stars represent alignment pins. The lighter blue areas represent the search subspaces that are actually used in the heuristic method. The optimal algorithm uses the entire rectangle in blue. The block diagram shows the optimal spliced alignment where blue boxes represent an exon and the thin lines represent an intron.

**Figure 5**
An incorrect alignment from Pairagon. The seed alignment from BLASTN aligned the 112-base exon at a location about 30 kb upstream (arrow in Pairagon gene prediction) instead of the annotated location (arrows in Gencode reference genes). Both alignments for that exon are 100% identical. (Screen shot obtained from UCSC Genome Browser web site [23].)

**Figure 6**
Initial exon of a gene where N-SCAN correctly discriminates coding region from the 5' UTR. Other gene prediction systems predict longer coding regions due to the high G+C content of the region. (Screen shot obtained from UCSC Genome Browser web site [23].)

**Figure 7**
A gene where N-SCAN_EST predicts three out of the four exons right. All other programs except AceView do not predict anything in that locus. N-SCAN_EST missed an exon even though there is EST evidence for it. We believe that lack of conservation overwhelmed the EST evidence for that exon. (Screen shot obtained from UCSC Genome Browser web site [23].)

See this image and copyright information in PMC

References

1. The MGC Project Team The status, quality, and expansion of the NIH full-length cDNA project: The Mammalian Gene Collection (MGC). Genome Res. 2004;14:2121–2127. doi: 10.1101/gr.2596504. - DOI - PMC - PubMed
1. Brent MR. Genome annotation past, present and future: How to define an ORF at each locus. Genome Res. 2005;15:1777–1786. doi: 10.1101/gr.3866105. - DOI - PubMed
1. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. - DOI - PMC - PubMed
1. Wei C, Brent MR. Integrating EST alignments and de novo gene prediction using TWINSCAN. BMC Bioinformatics. 2006. - PMC - PubMed
1. van Baren MJ, Brent MR. Iterative gene prediction and pseudo-gene removal improves genome annotation. Genome Res. 2006;16:678–685. doi: 10.1101/gr.4766206. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pairagon+N-SCAN_EST: a model-based gene annotation pipeline

Affiliation

Pairagon+N-SCAN_EST: a model-based gene annotation pipeline

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous