. 2024 Dec 4;16(12):evae252.

doi: 10.1093/gbe/evae252.

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Xyrus X Maurer-Alcalá¹, Eunsoo Kim^{1

2}

Affiliations

¹ Division of Invertebrate Zoology and Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA.
² Division of EcoScience, Ewha Womans University, Seoul, South Korea.

PMID: 39570867
PMCID: PMC11631190
DOI: 10.1093/gbe/evae252

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Xyrus X Maurer-Alcalá et al. Genome Biol Evol. 2024.

. 2024 Dec 4;16(12):evae252.

doi: 10.1093/gbe/evae252.

Authors

Xyrus X Maurer-Alcalá¹, Eunsoo Kim^{1

2}

Affiliations

¹ Division of Invertebrate Zoology and Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA.
² Division of EcoScience, Ewha Womans University, Seoul, South Korea.

PMID: 39570867
PMCID: PMC11631190
DOI: 10.1093/gbe/evae252

Abstract

Studying fundamental aspects of eukaryotic biology through genetic information can face numerous challenges, including contamination and intricate biotic interactions, which are particularly pronounced when working with uncultured eukaryotes. However, existing tools for predicting open reading frames (ORFs) from transcriptomes are limited in these scenarios. Here we introduce Transcript Identification and Selection (TIdeS), a framework designed to address these nontrivial challenges associated with current 'omics approaches. Using transcriptomes from 32 taxa, representing the breadth of eukaryotic diversity, TIdeS outperforms most conventional ORF-prediction methods (i.e. TransDecoder), identifying a greater proportion of complete and in-frame ORFs. Additionally, TIdeS accurately classifies ORFs using minimal input data, even in the presence of "heavy contamination". This built-in flexibility extends to previously unexplored biological interactions, offering a robust single-stop solution for precise ORF predictions and subsequent decontamination. Beyond applications in phylogenomic-based studies, TIdeS provides a robust means to explore biotic interactions in eukaryotes (e.g. host-symbiont, prey-predator) and for reproducible dataset curation from transcriptomes and genomes.

Keywords: ORF prediction; biotic interactions; contamination; machine learning; phylogenomics.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of TIdeS workflow for ORF-calling and ORF-classification. For ORF-calling, only a FASTA file of transcripts is required, and can be *optionally* processed to remove short transcripts, rRNA by-catch and redundant isoforms. Afterwards, TIdeS extracts all ORFs (partial and/or complete). For ORF classification, a FASTA file of predicted ORFs and a table of user-defined ORFs need to be provided. Composition (k-mer-based) for all ORFs are extracted, with the data for training ORFs passed to the SVM for training and hyperparameter optimization. Finally, TIdeS classifies the query ORFs and generates its predictions, including a trained model, which can be provided as an optional input.

**Fig. 2.**
Comparisons of ORF prediction precision and recall for TIdeS and widely used ORF prediction tools and approaches. a) Total ORF prediction precision is comparable across the tools tested, whereas prediction recall by TIdeS and GeneMarkS-T greatly outperforms other approaches. b) TIdeS predicts complete ORFs, for evolutionarily diverse eukaryotic taxa, with greater precision and recall than common prediction tools and approaches.

**Fig. 3.**
Limited training data are needed for accurate sequence classification. Across different scenarios of ORF composition (distinct, similar, and identical) and proportions of taxon representation, TIdeS can accurately classify ORFs with minimal training data. In scenarios where sequence composition is near identical, larger training datasets dramatically impact sequence classification, especially when taxon representation is near equal (i.e. 50:50). ORF composition is shown in each facet. %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.

**Fig. 4.**
TIdeS can accurately classify ORFs in multitaxon scenarios. Left, sequence composition plots for the unclassified predicted ORFs from a) an in silico contaminated transcriptome of a large predatory ciliate (*Stentor coeruleus*), diatom (*Phaeodactylum tricornutum*), and green alga (*Chlamydomonas reinhardtii*) and b) a transcriptome of a red alga (*Lithophyllum stictiforme*) heavily contaminated with an unidentified metazoan and a rhizarian. Right, sequence composition plot following classification with TIdeS, using minimal training data based on ORF composition from the initial Pre-TIdeS plot. %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.

**Fig. 5.**
TIdeS accurately classifies ciliate (*Platyophrya*) and contaminant (*Bodo*-like flagellate) sequences with phylogenomic-inferred training data. a) Exemplar single-gene phylogeny shows *Platyophrya* protein sequences (diamonds) nestled among other ciliates (Alveolata) and unexpectedly among kinetoplastids (Discoba) implying a contaminated dataset. b) *Platyophrya*'s composition “fingerprint” suggests strong, but not easily delineated, contamination. c) TIdeS accurately classifies *Platyophrya* sequences from 50 training sequences, enabling easy phylogenomic identification of the contaminant data (*Bodo*-like nanoflagellate). b and c) %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.

See this image and copyright information in PMC

References

1. Adl SM, Bass D, Lane CE, Lukeš J, Schoch CL, Smirnov A, Agatha S, Berney C, Brown MW, Burki F, et al. Revisions to the classification, nomenclature, and diversity of eukaryotes. J Eukaryot Microbiol. 2019:66(1):4–119. 10.1111/jeu.12691. - DOI - PMC - PubMed
1. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. 2019. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. New York, NY, USA: KDD ‘19 Association for Computing Machinery. pp. 2623–2631.
1. Aponte A, Gyaltshen Y, Burns JA, Heiss AA, Kim E, Warring SD. The bacterial diversity lurking in protist cell cultures. Am Mus Novit. 2021:2021(3975):1–14. 10.1206/3975.1. - DOI
1. Bachvaroff TR. A precedented nuclear genetic code with all three termination codons reassigned as sense codons in the syndinean Amoebophrya sp. Ex Karlodinium veneficum. PLoS One. 2019:14(2):e0212912. 10.1371/journal.pone.0212912. - DOI - PMC - PubMed
1. Blaz J, Galindo LJ, Heiss AA, Kaur H, Torruella G, Yang A, Alexa Thompson L, Filbert A, Warring S, Narechania A, et al. One high quality genome and two transcriptome datasets for new species of Mantamonas, a deep-branching eukaryote clade. Sci Data. 2023:10(1):603. 10.1038/s41597-023-02488-2. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

876199/Simons Foundation

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Affiliations

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Authors

Affiliations

Abstract

Figures

Similar articles

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources