Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul 1:15:229.
doi: 10.1186/1471-2105-15-229.

SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models

Affiliations

SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models

Ian Reid et al. BMC Bioinformatics. .

Abstract

Background: Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction. Computational gene predictors have been intensively developed for and tested in well-studied animal genomes. Hundreds of fungal genomes are now or will soon be sequenced. The differences of fungal genomes from animal genomes and the phylogenetic sparsity of well-studied fungi call for gene-prediction tools tailored to them.

Results: SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa; SnowyOwl predicted N. crassa genes with 83% sensitivity and 65% specificity. SnowyOwl gains sensitivity by repeatedly running the HMM gene predictor Augustus with varied input parameters and selectivity by choosing the models with best homology to known proteins and best agreement with the RNA-Seq data.

Conclusions: SnowyOwl efficiently uses RNA-Seq data to produce accurate gene models in both well-studied and novel fungal genomes. The source code for the SnowyOwl pipeline (in Python) and a web interface (in PHP) is freely available from http://sourceforge.net/projects/snowyowl/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Stages of information flow from DNA and RNA sequences to gene models in the SnowyOwl pipeline.
Figure 2
Figure 2
Model scoring flowchart. The bold line marks the main path. Models that follow paths leading to boxes with pink backgrounds are imperfect; models with paths ending in boxes with green backgrounds are potentially accepted.
Figure 3
Figure 3
Selection of the best scored models in a typical region of the P. chrysosporium genome with overlapping gene predictions. Representative models are the highest scoring models at their locations, and accepted models are representatives that are consistent with all available evidence. In this region, the SnowyOwl accepted models matched the manually curated models, but previous (JGI) models showed small differences (marked with red ovals). Accepted models and the representative and candidate models that match them are outlined in green. The colour intensity in each exon is proportional to its score. Marked introns were verified by detecting spanning spliced reads with tuqueSplit [39]. Orange bars at the top of the read coverage track show regions of coverage depth > 1000. The data were visualized with GBrowse2 [40].
Figure 4
Figure 4
Effect of read coverage depth on sensitivity (A) and specificity (B) of Neurospora crassa gene and exon prediction by SnowyOwl. Solid lines: all coordinates matched exactly; dashed lines: exon start coordinates were not required to match. Read coverage depth is the number of reads mapped to a feature divided by the feature length.
Figure 5
Figure 5
Relationships between the sensitivity and specificity of predicting Neurospora crassa exons and genes by various methods. Prediction sets were from GeneMark-ES, Augustus run with the neuropora_crassa species parameters included in the Augustus distribution, unhinted or with RNA-Seq hits, the Pooled Blat-hinted Augustus models from SnowyOwl, all the candidate models generated by SnowyOwl, and the final SnowyOwl accepted models. Models with read coverage below 0.5 were removed from each set.
Figure 6
Figure 6
Distribution of non-zero scores for A. niger exon models before selection of representatives. The peaks marked by arrows near scores of 2, 3, and 6 arise from gene models with homology to 1, 2, or 3 known genes, respectively. In this sample, 44% of the exons had score 0.
Figure 7
Figure 7
Distribution of the degree of overlap between SnowyOwl and previous gene predictions in A. niger and P. chrysosporium. Many SnowyOwl models, especially in A. niger, are identical to previous gene predictions or differ only at their start position, but there are some unique models in each set with less than 5% overlap in the other set, and numerous model pairs with intermediate levels of overlap. The higher numbers of gene models in the P. chrysosporium SnowyOwl set mainly result from predictions at locations where no gene was previously predicted. The counts for identical models and models differing only by alternative start positions are offset from 100% overlap for visibility.

References

    1. Majoros WH. Methods for Computational Gene Prediction. New York: Cambridge University Press; 2007.
    1. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Suppl 2):ii215–ii225. - PubMed
    1. Salamov AA, Solovyev VV. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000;10:516–522. - PMC - PubMed
    1. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18:1979–1990. - PMC - PubMed
    1. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. - PMC - PubMed

LinkOut - more resources