Fishing for a reelGene: evaluating gene models with evolution and machine learning
- PMID: 40983054
- DOI: 10.1111/tpj.70483
Fishing for a reelGene: evaluating gene models with evolution and machine learning
Abstract
Assembled genomes and their associated annotations have transformed our study of gene function. However, each new annotated assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses the conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million transcript models in Zea mays ssp. mays (maize), reelGene classified 28% as incorrectly annotated or non-functional. We find that reelGene classifies 92.2% of genes in the maize proteome and 99.2% of genes within the maize classical gene list as functional. reelGene also provides a way to further investigate genome biology- for instance, reelGene indicates that 10.3% of dispensable genes in B73 are functional, and within retained duplicate genes, reelGene identifies a 30% bias toward the retention of the M1 subgenome when one copy is functional and the other is non-functional. As an annotation-evaluating tool, reelGene is directly applicable to species of the Andropogoneae tribe, including other important crops like sorghum and miscanthus. As a community resource, reelGene has been integrated onto MaizeGDB both as a browser track and as an individual Shiny App, allowing researchers to evaluate gene model accuracy and further investigate genome biology.
Keywords: evolution; gene annotation; gene models; genome biology; machine learning; maize.
© 2025 Society for Experimental Biology and John Wiley & Sons Ltd.
References
-
- Bányai, L. & Patthy, L. (2016) Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors. Scientific Reports, 6, 30700.
-
- Barbaglia, A.M., Klusman, K.M., Higgins, J., Shaw, J.R., Hannah, L.C. & Lal, S.K. (2012) Gene capture by Helitron transposons reshuffles the transcriptome of maize. Genetics, 190(3), 965–975.
-
- Benegas, G., Batra, S.S. & Song, Y.S. (2023) DNA language models are powerful predictors of genome‐wide variant effects. Proceedings of the National Academy of Sciences of the United States of America, 120(44), e2311219120.
-
- Bennetzen, J.L., Coleman, C., Liu, R., Ma, J. & Ramakrishna, W. (2004) Consistent over‐estimation of gene number in complex plant genomes. Current Opinion in Plant Biology, 7(6), 732–736.
-
- Bernal‐Gallardo, J.J. & de Folter, S. (2024) Plant genome information facilitates plant functional genomics. Planta, 259(5), 117.
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials