AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data

Guilherme Augusto Maia¹, Vilmar Benetti Filho¹, Eric Kazuo Kawagoe¹, Tatiany Aparecida Teixeira Soratto¹, Renato Simões Moreira^{1

2}, Edmundo Carlos Grisard^{1

3}, Glauber Wagner^{1

3}

Affiliations

¹ Laboratório de Bioinformática, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.
² Instituto Federal de Santa Catarina (IFSC), Campus Lages, Lages, Brazil.
³ Laboratório de Protozoologia, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.

PMID: 36482896
PMCID: PMC9723129
DOI: 10.3389/fgene.2022.1020100

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data

Guilherme Augusto Maia et al. Front Genet. 2022.

. 2022 Nov 22:13:1020100.

doi: 10.3389/fgene.2022.1020100. eCollection 2022.

Authors

Guilherme Augusto Maia¹, Vilmar Benetti Filho¹, Eric Kazuo Kawagoe¹, Tatiany Aparecida Teixeira Soratto¹, Renato Simões Moreira^{1

2}, Edmundo Carlos Grisard^{1

3}, Glauber Wagner^{1

3}

Affiliations

¹ Laboratório de Bioinformática, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.
² Instituto Federal de Santa Catarina (IFSC), Campus Lages, Lages, Brazil.
³ Laboratório de Protozoologia, Universidade Federal de Santa Catarina (UFSC), Campus João David Ferreira Lima, Florianópolis, Brazil.

PMID: 36482896
PMCID: PMC9723129
DOI: 10.3389/fgene.2022.1020100

Abstract

Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of in silico prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available Arabidopsis thaliana, Caenorhabditis elegans, Candida albicans, Trypanosoma cruzi, and Trypanosoma rangeli genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.

Keywords: functional annotation; genome annotation; hypothetical proteins; proteogenomics; workflow.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Overview of AnnotaPipeline workflow, indicating the optional and the required inputs from the user, the internal processes, and the output layers.

See this image and copyright information in PMC

References

1. Amos B., Aurrecoechea C., Barba M., Barreto A., Basenko E. Y., Bazant W., et al. (2022). VEuPathDB: The eukaryotic pathogen, vector and host bioinformatics resource center. Nucleic Acids Res. 50 (D1), D898–D911. 10.1093/nar/gkab929 - DOI - PMC - PubMed
1. Bray N. L., Pimentel H., Melsted P., Pachter L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34 (5), 525–527. 10.1038/nbt.3519 - DOI - PubMed
1. Brůna T., Lomsadze A., Borodovsky M. (2020). GeneMark-EP+: Eukaryotic gene prediction with self-training in the space of genes and proteins. Nar. Genom. Bioinform. 2 (2), lqaa026. 10.1093/nargab/lqaa026 - DOI - PMC - PubMed
1. Buchfink B., Reuter K., Drost H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18 (4), 366–368. 10.1038/s41592-021-01101-x - DOI - PMC - PubMed
1. Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., et al. (2009). BLAST+: Architecture and applications. BMC Bioinforma. 10 (1), 421. 10.1186/1471-2105-10-421 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data

Affiliations

AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources