Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 4;16(12):evae252.
doi: 10.1093/gbe/evae252.

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Affiliations

TIdeS: A Comprehensive Framework for Accurate Open Reading Frame Identification and Classification in Eukaryotic Transcriptomes

Xyrus X Maurer-Alcalá et al. Genome Biol Evol. .

Abstract

Studying fundamental aspects of eukaryotic biology through genetic information can face numerous challenges, including contamination and intricate biotic interactions, which are particularly pronounced when working with uncultured eukaryotes. However, existing tools for predicting open reading frames (ORFs) from transcriptomes are limited in these scenarios. Here we introduce Transcript Identification and Selection (TIdeS), a framework designed to address these nontrivial challenges associated with current 'omics approaches. Using transcriptomes from 32 taxa, representing the breadth of eukaryotic diversity, TIdeS outperforms most conventional ORF-prediction methods (i.e. TransDecoder), identifying a greater proportion of complete and in-frame ORFs. Additionally, TIdeS accurately classifies ORFs using minimal input data, even in the presence of "heavy contamination". This built-in flexibility extends to previously unexplored biological interactions, offering a robust single-stop solution for precise ORF predictions and subsequent decontamination. Beyond applications in phylogenomic-based studies, TIdeS provides a robust means to explore biotic interactions in eukaryotes (e.g. host-symbiont, prey-predator) and for reproducible dataset curation from transcriptomes and genomes.

Keywords: ORF prediction; biotic interactions; contamination; machine learning; phylogenomics.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of TIdeS workflow for ORF-calling and ORF-classification. For ORF-calling, only a FASTA file of transcripts is required, and can be optionally processed to remove short transcripts, rRNA by-catch and redundant isoforms. Afterwards, TIdeS extracts all ORFs (partial and/or complete). For ORF classification, a FASTA file of predicted ORFs and a table of user-defined ORFs need to be provided. Composition (k-mer-based) for all ORFs are extracted, with the data for training ORFs passed to the SVM for training and hyperparameter optimization. Finally, TIdeS classifies the query ORFs and generates its predictions, including a trained model, which can be provided as an optional input.
Fig. 2.
Fig. 2.
Comparisons of ORF prediction precision and recall for TIdeS and widely used ORF prediction tools and approaches. a) Total ORF prediction precision is comparable across the tools tested, whereas prediction recall by TIdeS and GeneMarkS-T greatly outperforms other approaches. b) TIdeS predicts complete ORFs, for evolutionarily diverse eukaryotic taxa, with greater precision and recall than common prediction tools and approaches.
Fig. 3.
Fig. 3.
Limited training data are needed for accurate sequence classification. Across different scenarios of ORF composition (distinct, similar, and identical) and proportions of taxon representation, TIdeS can accurately classify ORFs with minimal training data. In scenarios where sequence composition is near identical, larger training datasets dramatically impact sequence classification, especially when taxon representation is near equal (i.e. 50:50). ORF composition is shown in each facet. %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.
Fig. 4.
Fig. 4.
TIdeS can accurately classify ORFs in multitaxon scenarios. Left, sequence composition plots for the unclassified predicted ORFs from a) an in silico contaminated transcriptome of a large predatory ciliate (Stentor coeruleus), diatom (Phaeodactylum tricornutum), and green alga (Chlamydomonas reinhardtii) and b) a transcriptome of a red alga (Lithophyllum stictiforme) heavily contaminated with an unidentified metazoan and a rhizarian. Right, sequence composition plot following classification with TIdeS, using minimal training data based on ORF composition from the initial Pre-TIdeS plot. %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.
Fig. 5.
Fig. 5.
TIdeS accurately classifies ciliate (Platyophrya) and contaminant (Bodo-like flagellate) sequences with phylogenomic-inferred training data. a) Exemplar single-gene phylogeny shows Platyophrya protein sequences (diamonds) nestled among other ciliates (Alveolata) and unexpectedly among kinetoplastids (Discoba) implying a contaminated dataset. b) Platyophrya's composition “fingerprint” suggests strong, but not easily delineated, contamination. c) TIdeS accurately classifies Platyophrya sequences from 50 training sequences, enabling easy phylogenomic identification of the contaminant data (Bodo-like nanoflagellate). b and c) %GC12: GC-content at codon 1st/2nd positions; %GC3: GC-content at codon 3rd positions.

Similar articles

References

    1. Adl SM, Bass D, Lane CE, Lukeš J, Schoch CL, Smirnov A, Agatha S, Berney C, Brown MW, Burki F, et al. Revisions to the classification, nomenclature, and diversity of eukaryotes. J Eukaryot Microbiol. 2019:66(1):4–119. 10.1111/jeu.12691. - DOI - PMC - PubMed
    1. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. 2019. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. New York, NY, USA: KDD ‘19 Association for Computing Machinery. pp. 2623–2631.
    1. Aponte A, Gyaltshen Y, Burns JA, Heiss AA, Kim E, Warring SD. The bacterial diversity lurking in protist cell cultures. Am Mus Novit. 2021:2021(3975):1–14. 10.1206/3975.1. - DOI
    1. Bachvaroff TR. A precedented nuclear genetic code with all three termination codons reassigned as sense codons in the syndinean Amoebophrya sp. Ex Karlodinium veneficum. PLoS One. 2019:14(2):e0212912. 10.1371/journal.pone.0212912. - DOI - PMC - PubMed
    1. Blaz J, Galindo LJ, Heiss AA, Kaur H, Torruella G, Yang A, Alexa Thompson L, Filbert A, Warring S, Narechania A, et al. One high quality genome and two transcriptome datasets for new species of Mantamonas, a deep-branching eukaryote clade. Sci Data. 2023:10(1):603. 10.1038/s41597-023-02488-2. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources