Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun;2(2):lqaa026.
doi: 10.1093/nargab/lqaa026. Epub 2020 May 13.

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Affiliations

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Tomáš Brůna et al. NAR Genom Bioinform. 2020 Jun.

Abstract

We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A flowchart of the GeneMark-EP and -EP+ iterative training.
Figure 2.
Figure 2.
An overview of the ProtHint pipeline.
Figure 3.
Figure 3.
Selection of sequence regions for GeneMark-EP+ training with enforcement of high-confidence (HC) hints.
Figure 4.
Figure 4.
ProtHint intron processing in case of N. crassa. Introns were generated by spliced alignments of target proteins from species beyond Neurospora genus. (A) Distribution of the score vectors (IBA, IMC) of true positive (green) and false positive (purple) introns. The black lines represent cutoffs at IMC = 4 and IBA = 0.25. Total numbers of false and true positives are shown in the upper left corner. (B) Sn and Sp of intron sets selected by thresholds on IBA score and IMC score. IMC score is computed for introns that have IBA score ≥ 0.1 and exon AEE score ≥ 25. The red curve represents the following. The left branch of the curve reflects (Sp, Sn) values of the sets of introns selected by using IMC threshold from 0 to 4. The one with the IMC threshold = 4 is recorded as set A—the set corresponding to the black circle in the red curve. Then, the right branch of the curve reflects (Sp, Sn) of the set of introns generated by applying to set A an IBA score threshold changing from 0 to 0.25 and up to 1.0. Set B corresponds to the black cross in the red curve; introns in this set have IMC ≥ 4 and IBA ≥ 0.25. Separate curves for IMC score change (dashed blue) and IBA score change (dashed purple) are shown as well.
Figure 5.
Figure 5.
Comparison of GeneMark-ES and GeneMark-EP+ accuracy on gene level. Accuracy of GeneMark-EP+ is shown for cases when ProtHint works with different size sets of reference OrthoDB proteins: from the largest (only proteins from the same species are excluded) to the smallest (proteins of the whole phylum excluded). A gene prediction is considered to be correct if it matches one of the annotated isoforms. For D. rerio, gene-level Sn was computed only with respect to complete genes.

References

    1. Hoff K.J., Stanke M.. Predicting genes in single genomes with AUGUSTUS. Curr. Protoc. Bioinformatics. 2019; 65:e57. - PubMed
    1. Lomsadze A., Burns P.D., Borodovsky M.. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 2014; 42:e119. - PMC - PubMed
    1. Foissac S., Gouzy J., Rombauts S., Mathe C., Amselem J., Sterck L., Van de Peer Y., Rouze P., Schiex T.. Genome annotation in plants and fungi: EuGene as a model platform. Curr. Bioinformatics. 2008; 3:87–97.
    1. Sallet E., Gouzy J., Schiex T.. EuGene: an automated integrative gene finder for eukaryotes and prokaryotes. Methods Mol. Biol. 2019; 1962:97–120. - PubMed
    1. Behr J., Bohnert R., Zeller G., Schweikert G., Hartmann L., Rätsch G.. Next generation genome annotation with mGene.ngs. BMC Bioinformatics. 2010; 11:O8.