Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr;37(7):e52.
doi: 10.1093/nar/gkp052. Epub 2009 Mar 5.

Finding genes in Schistosoma japonicum: annotating novel genomes with help of extrinsic evidence

Affiliations

Finding genes in Schistosoma japonicum: annotating novel genomes with help of extrinsic evidence

Brona Brejová et al. Nucleic Acids Res. 2009 Apr.

Abstract

We have developed a novel method for estimating the parameters of hidden Markov models for gene finding in newly sequenced species. Our approach does not rely on curated training data sets, but instead uses extrinsic evidence (including paired-end ditags that have not been used in gene finding previously) and iterative training. This new method is particularly suitable for annotation of species with large evolutionary distance to the closest annotated species. We have used our approach to produce an initial annotation of more than 16,000 genes in the newly sequenced Schistosoma japonicum draft genome. We established the high quality of our predictions by comparison to full-length cDNAs (withdrawn from the extrinsic evidence) and to CEGMA core genes. We also evaluated the effectiveness of the new training procedure on Caenorhabditis elegans genome. ExonHunter and the newest parametric files for S. japonicum genome are available for download at www.bioinformatics.uwaterloo.ca/downloads/exonhunter.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Evolutionary distance of S. japonicum from well–annotated species. The phylogeny was derived by maximum likelihood from a multiple alignment of small ribosomal subunit RNAs (9) using PHYML (10) and MUSCLE (11).
Figure 2.
Figure 2.
Selection of supported gene fragments. ExonHunter integrates several sources of extrinsic evidence (such as sequence repeats, known proteins, ESTs and PETs). The figure shows an example of alignments of individual sources to genomic sequence at a particular locus. This information is combined into the super advisor (super advisor score for coding regions is shown). The same super advisor scores are used to aid in gene prediction, and to identify supported fragments of the gene for training.
Figure 3.
Figure 3.
Evaluation of iterative training on C. elegans (gene and exon sensitivity). Each line in the plots is annotated with the method used for training and the level of extrinsic evidence used for both training and testing. Using supported gene fragments helps iterative training to achieve the performance close to the training on a curated training set. The filtering step is important especially when working with weak extrinsic evidence. Specificity comparison leads to the same conclusion, see Supplementary Data. In all experiments on C. elegans, ExonHunter under-predicted the number of genes, as can be seen by comparison of gene sensitivities and specificities in Supplementary Table S1.
Figure 4.
Figure 4.
Influence of PETs on ExonHunter predictions. The PET mapped to the genome sequence correctly identifies the extent of the transcript supported by full-length cDNA (the transcript includes untranslated regions shown as shaded areas). The prediction of ExonHunter without PETs incorrectly identifies start of the gene and adds two spurious exons to the transcript. Using PETs not only helps to identify the start site, but also corrects the reading frame of the first exon and acceptor site of the second exon.
Figure 5.
Figure 5.
CEGMA and ExonHunter predictions of vacuolar proton pump subunit C homolog. ExonHunter combines evidence from S. japonicum and S. mansoni ESTs, as well as SWISSPROT protein to predict the gene structure. The ExonHunter prediction of this gene is of better quality than the core gene predicted by the CEGMA pipeline.

References

    1. World Health Organization Expert Committee. Technical report 830. 1993. The control of schistosomiasis. WHO technical report series. - PubMed
    1. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. - PMC - PubMed
    1. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33:6494–6496. - PMC - PubMed
    1. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7(Suppl. 1):1–31. - PMC - PubMed
    1. Brejova B, Brown DG, Li M, Vinar T. ExonHunter: a comprehensive approach to gene finding. Bioinformatics. 2005;21(Suppl. 1):i57–i65. - PubMed

Publication types