GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Tomáš Brůna¹, Alexandre Lomsadze², Mark Borodovsky^{1

2

3}

Affiliations

¹ School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA.
² Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.
³ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.

PMID: 32440658
PMCID: PMC7222226
DOI: 10.1093/nargab/lqaa026

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Tomáš Brůna et al. NAR Genom Bioinform. 2020 Jun.

. 2020 Jun;2(2):lqaa026.

doi: 10.1093/nargab/lqaa026. Epub 2020 May 13.

Authors

Tomáš Brůna¹, Alexandre Lomsadze², Mark Borodovsky^{1

2

3}

Affiliations

¹ School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA.
² Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.
³ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.

PMID: 32440658
PMCID: PMC7222226
DOI: 10.1093/nargab/lqaa026

Abstract

We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

PubMed Disclaimer

Figures

**Figure 1.**
A flowchart of the GeneMark-EP and -EP+ iterative training.

**Figure 2.**
An overview of the ProtHint pipeline.

**Figure 3.**
Selection of sequence regions for GeneMark-EP+ training with enforcement of high-confidence (HC) hints.

**Figure 4.**
ProtHint intron processing in case of *N. crassa*. Introns were generated by spliced alignments of target proteins from species beyond *Neurospora* genus. (A) Distribution of the score vectors (IBA, IMC) of true positive (green) and false positive (purple) introns. The black lines represent cutoffs at IMC = 4 and IBA = 0.25. Total numbers of false and true positives are shown in the upper left corner. (B) Sn and Sp of intron sets selected by thresholds on IBA score and IMC score. IMC score is computed for introns that have IBA score ≥ 0.1 and exon AEE score ≥ 25. The red curve represents the following. The left branch of the curve reflects (Sp, Sn) values of the sets of introns selected by using IMC threshold from 0 to 4. The one with the IMC threshold = 4 is recorded as set A—the set corresponding to the black circle in the red curve. Then, the right branch of the curve reflects (Sp, Sn) of the set of introns generated by applying to set A an IBA score threshold changing from 0 to 0.25 and up to 1.0. Set B corresponds to the black cross in the red curve; introns in this set have IMC ≥ 4 and IBA ≥ 0.25. Separate curves for IMC score change (dashed blue) and IBA score change (dashed purple) are shown as well.

**Figure 5.**
Comparison of GeneMark-ES and GeneMark-EP+ accuracy on gene level. Accuracy of GeneMark-EP+ is shown for cases when ProtHint works with different size sets of reference OrthoDB proteins: from the largest (only proteins from the same species are excluded) to the smallest (proteins of the whole phylum excluded). A gene prediction is considered to be correct if it matches one of the annotated isoforms. For *D. rerio*, gene-level Sn was computed only with respect to complete genes.

See this image and copyright information in PMC

References

1. Hoff K.J., Stanke M.. Predicting genes in single genomes with AUGUSTUS. Curr. Protoc. Bioinformatics. 2019; 65:e57. - PubMed
1. Lomsadze A., Burns P.D., Borodovsky M.. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 2014; 42:e119. - PMC - PubMed
1. Foissac S., Gouzy J., Rombauts S., Mathe C., Amselem J., Sterck L., Van de Peer Y., Rouze P., Schiex T.. Genome annotation in plants and fungi: EuGene as a model platform. Curr. Bioinformatics. 2008; 3:87–97.
1. Sallet E., Gouzy J., Schiex T.. EuGene: an automated integrative gene finder for eukaryotes and prokaryotes. Methods Mol. Biol. 2019; 1962:97–120. - PubMed
1. Behr J., Bohnert R., Zeller G., Schweikert G., Hartmann L., Rätsch G.. Next generation genome annotation with mGene.ngs. BMC Bioinformatics. 2010; 11:O8.

Grants and funding

R01 GM128145/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Affiliations

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources