. 2021 Jun 2;10(6):giab042.

doi: 10.1093/gigascience/giab042.

Mantis: flexible and consensus-driven genome annotation

Pedro Queirós¹, Francesco Delogu¹, Oskar Hickl², Patrick May², Paul Wilmes¹

Affiliations

¹ Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg.
² Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg.

PMID: 34076241
PMCID: PMC8170692
DOI: 10.1093/gigascience/giab042

Mantis: flexible and consensus-driven genome annotation

Pedro Queirós et al. Gigascience. 2021.

. 2021 Jun 2;10(6):giab042.

doi: 10.1093/gigascience/giab042.

Authors

Pedro Queirós¹, Francesco Delogu¹, Oskar Hickl², Patrick May², Paul Wilmes¹

Affiliations

¹ Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg.
² Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 Avenue du Swing, 4367 Esch-sur-Alzette, Luxembourg.

PMID: 34076241
PMCID: PMC8170692
DOI: 10.1093/gigascience/giab042

Abstract

Background: The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data. Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation can be described as the identification of regions of interest (i.e., domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility. In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources.

Results: We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation. The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations.

Conclusions: Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license at https://github.com/PedroMTQ/mantis.

Keywords: HMM; bioinformatics; consensus; homology; protein function annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Figure 1:**
Overview of the Mantis workflow. KOfam [55], Pfam [56], eggNOG [57], NCBI protein family models (NPFM) [58], and TIGRfams [59] are the reference HMMs currently used in Mantis. CustomDB can be any HMM library provided by the user.

**Figure 2:**
Homolog selection for the 3 hit-processing algorithms in Mantis. The selection of the hit(s) depends on the underlying algorithm. In the case of the portrayed protein with 6 hits (A) (which are overlapping to various degrees) that have varying significance values (B) the 3 algorithms would behave as follows: (i) BPO would select only the most significant hit (No. 2); (ii) the heuristic algorithm initially selects the most significant hit (No. 2), which then restricts (due to overlapping residues) the hits available for selection (hits 1, 3, and 4 can no longer be selected), leading to the selection of the next most significant hit (No. 6), and finally the selection of hit 5; (iii) the DFS algorithm generates all possible combinations of hits, which are then scored according to the e-value, hit coverage, and total combination coverage (for more details, see “Multiple hits per protein”). According to these parameters, the most likely combinations of hits would be hits 1 and 4.

**Figure 3:**
Annotation F1 score per hit-processing algorithm and sample. Overall, the DFS and heuristic algorithms achieve similar results, outperforming the BPO algorithm.

**Figure 4:**
F1 score per hit-processing algorithm and organism, with and without using taxonomy information. F1 score was higher for well-studied organisms; TSHMMs also tend to perform better with these organisms.

**Figure 5:**
Annotation F1 score of Mantis, eggNOG-mapper, and Prokka using different reference data. Each slice represents an organism and contains the F1 score obtained between the different conditions.

**Figure 6:**
The impact of the reference data completeness on protein function annotation. A. The functional prediction is facilitated by the query sequence being previously identified and included in the reference HMMs. B. If the query sequence has not been previously annotated, multiple regions in the protein may match with different reference HMMs.

**Figure 7:**
Inter-HMM hit-processing steps. Inter-HMM hit processing starts by pooling all hits [A1, AN] together (regardless of the reference data source) and generating all the possible (non-overlapping coordinates) combinations [c1, cN] (A). A metadata consistency graph (B) is also built by connecting all nodes [M1, MN] that have intersecting IDs or highly similar descriptions (e.g., A1’s metadata M1 is consistent with A2’s metadata M2 (shared ID1), and A5’s metadata M5 is consistent with A6’s metadata M6 (similar description "glucose degradation"). With this metadata consistency graph, the hit consistency HCN score of each combination is calculated. For c1, for example, a sub-graph containing M1, M5, and all directly connected nodes (only M2 and M6 but not M4 because it has insufficient residue overlap—A4) would be created. The number of nodes in this sub-graph would then be divided by the total number of nodes in the original graph; therefore c1 would have an HCN of (2 + 2)/8 = 0.5. The remaining parameters would then be calculated and the best combination, according to equation 2, would be selected. Finally, if, for example, the best combination is c1, then this combination is expanded by merging all nodes directly or indirectly connected to M1 and M5 in the metadata consistency graph (C) and with sufficient residue overlap (i.e., M2, M6, M7, M8). The expanded combination is then merged into the final consensus annotation (D).

See this image and copyright information in PMC

Cited by

Challenges, Strategies, and Perspectives for Reference-Independent Longitudinal Multi-Omic Microbiome Studies.
Martínez Arbas S, Busi SB, Queirós P, de Nies L, Herold M, May P, Wilmes P, Muller EEL, Narayanasamy S. Martínez Arbas S, et al. Front Genet. 2021 Jun 14;12:666244. doi: 10.3389/fgene.2021.666244. eCollection 2021. Front Genet. 2021. PMID: 34194470 Free PMC article.
Functional prediction of proteins from the human gut archaeome.
Novikova PV, Bhanu Busi S, Probst AJ, May P, Wilmes P. Novikova PV, et al. ISME Commun. 2024 Jan 10;4(1):ycad014. doi: 10.1093/ismeco/ycad014. eCollection 2024 Jan. ISME Commun. 2024. PMID: 38486809 Free PMC article.
Microbial communities reveal niche partitioning across the slope and bottom zones of the challenger deep.
Hu A, Zhao W, Wang J, Qi Q, Xiao X, Jing H. Hu A, et al. Environ Microbiol Rep. 2024 Aug;16(4):e13314. doi: 10.1111/1758-2229.13314. Environ Microbiol Rep. 2024. PMID: 39086173 Free PMC article.
Phylogenomic Analyses of Snodgrassella Isolates from Honeybees and Bumblebees Reveal Taxonomic and Functional Diversity.
Cornet L, Cleenwerck I, Praet J, Leonard RR, Vereecken NJ, Michez D, Smagghe G, Baurain D, Vandamme P. Cornet L, et al. mSystems. 2022 Jun 28;7(3):e0150021. doi: 10.1128/msystems.01500-21. Epub 2022 May 23. mSystems. 2022. PMID: 35604118 Free PMC article.
Critical Assessment of MetaProteome Investigation (CAMPI): a multi-laboratory comparison of established workflows.
Van Den Bossche T, Kunath BJ, Schallert K, Schäpe SS, Abraham PE, Armengaud J, Arntzen MØ, Bassignani A, Benndorf D, Fuchs S, Giannone RJ, Griffin TJ, Hagen LH, Halder R, Henry C, Hettich RL, Heyer R, Jagtap P, Jehmlich N, Jensen M, Juste C, Kleiner M, Langella O, Lehmann T, Leith E, May P, Mesuere B, Miotello G, Peters SL, Pible O, Queiros PT, Reichl U, Renard BY, Schiebenhoefer H, Sczyrba A, Tanca A, Trappe K, Trezzi JP, Uzzau S, Verschaffelt P, von Bergen M, Wilmes P, Wolf M, Martens L, Muth T. Van Den Bossche T, et al. Nat Commun. 2021 Dec 15;12(1):7305. doi: 10.1038/s41467-021-27542-8. Nat Commun. 2021. PMID: 34911965 Free PMC article.

See all "Cited by" articles

References

1. Segata N, Boernigen D, Tickle TL, et al. Computational meta’omics for microbial community studies. Mol Syst Biol. 2013;9:666. - PMC - PubMed
1. Muller E, Glaab E, May P, et al. Condensing the omics fog of microbial communities. Trends Microbiol. 2013;21(7):325–33. - PubMed
1. Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys. 2003;36(3):307–40. - PubMed
1. Arias C, Weisburd B, Stern-Ginossar N, et al. KSHV 2.0: A comprehensive annotation of the Kaposi’s sarcoma-associated herpesvirus genome using next-generation sequencing reveals novel genomic and functional features. PLoS Pathog. 2014;10(1):e1003847. - PMC - PubMed
1. Chapel A, Kieffer-Jaquinod S, Sagné C, et al. An extended proteome map of the lysosomal membrane reveals novel potential transporters. Mol Cell Proteomics. 2013;12(6):1572–88. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

863664/ERC_/European Research Council/International

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mantis: flexible and consensus-driven genome annotation

Affiliations

Mantis: flexible and consensus-driven genome annotation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources