Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 27;8(2):e0117822.
doi: 10.1128/msystems.01178-22. Epub 2023 Mar 7.

Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method

Affiliations

Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method

Mary Maranga et al. mSystems. .

Abstract

Comprehensive protein function annotation is essential for understanding microbiome-related disease mechanisms in the host organisms. However, a large portion of human gut microbial proteins lack functional annotation. Here, we have developed a new metagenome analysis workflow integrating de novo genome reconstruction, taxonomic profiling, and deep learning-based functional annotations from DeepFRI. This is the first approach to apply deep learning-based functional annotations in metagenomics. We validate DeepFRI functional annotations by comparing them to orthology-based annotations from eggNOG on a set of 1,070 infant metagenomes from the DIABIMMUNE cohort. Using this workflow, we generated a sequence catalogue of 1.9 million nonredundant microbial genes. The functional annotations revealed 70% concordance between Gene Ontology annotations predicted by DeepFRI and eggNOG. DeepFRI improved the annotation coverage, with 99% of the gene catalogue obtaining Gene Ontology molecular function annotations, although they are less specific than those from eggNOG. Additionally, we constructed pangenomes in a reference-free manner using high-quality metagenome-assembled genomes (MAGs) and analyzed the associated annotations. eggNOG annotated more genes on well-studied organisms, such as Escherichia coli, while DeepFRI was less sensitive to taxa. Further, we show that DeepFRI provides additional annotations in comparison to the previous DIABIMMUNE studies. This workflow will contribute to novel understanding of the functional signature of the human gut microbiome in health and disease as well as guiding future metagenomics studies. IMPORTANCE The past decade has seen advancement in high-throughput sequencing technologies resulting in rapid accumulation of genomic data from microbial communities. While this growth in sequence data and gene discovery is impressive, the majority of microbial gene functions remain uncharacterized. The coverage of functional information coming from either experimental sources or inferences is low. To solve these challenges, we have developed a new workflow to computationally assemble microbial genomes and annotate the genes using a deep learning-based model DeepFRI. This improved microbial gene annotation coverage to 1.9 million metagenome-assembled genes, representing 99% of the assembled genes, which is a significant improvement compared to 12% Gene Ontology term annotation coverage by commonly used orthology-based approaches. Importantly, the workflow supports pangenome reconstruction in a reference-free manner, allowing us to analyze the functional potential of individual bacterial species. We therefore propose this alternative approach combining deep-learning functional predictions with the commonly used orthology-based annotations as one that could help us uncover novel functions observed in metagenomic microbiome studies.

Keywords: deep learning; functional annotation; gene function; genome; metagenome; metagenome-assembled genomes; metagenomics; microbiome; orthology; pangenome.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

FIG 1
FIG 1
Schematic workflow overview.
FIG 2
FIG 2
Phylogenetic analysis and taxonomic annotation of high-quality metagenome-assembled genomes. (a) Maximum-likelihood phylogenetic tree of 2,255 high-quality, nearly complete genomes. The taxonomy of the MAGs was assigned by GTDB-Tk. The innermost layer corresponds to 10 bacterial classes. The second and third rings represent the proportions of genes annotated by DeepFRI and eggNOG. Bars in the outermost layer indicate the number of gene families per MAG. (b) Distribution of annotated genes per class (bar plot shows mean proportion of annotated genes, and error bars show 5th and 95th percentiles of proportions within a taxonomical class on the x axis). “N” refers to the number of genomes in each class.
FIG 3
FIG 3
Concordance between DeepFRI, eggNOG, and HUMAnN2 annotations. (a) The information content of gene functions predictions by DeepFRI, eggNOG, and HUMAnN2. (b) Percentage of concordant and discordant annotations between DeepFRI and eggNOG per each information content level. (c) Percentage of concordant and discordant annotations between DeepFRI and HUMAnN2 per information content level. (d) Consensus between DeepFRI and eggNOG annotations. See Table 1 at https://github.com/bioinf-mcb/metagenome_assembly/tree/master/supplementary_tables for the list of information content values.
FIG 4
FIG 4
Comparison of predictions between DeepFRI, eggNOG, and HUMAnN2. (a) Venn diagram of the number of gene sets annotated by DeepFRI and eggNOG gene ontology (all GO terms). (b) Three-way comparisons of gene sets annotated by DeepFRI, eggNOG gene ontology (all GO terms), and eggNOG free text description. (c) Venn diagram comparisons of gene sets annotated by DeepFRI and eggNOG using only informative gene ontology terms. (d) Abundance of genes (in CPM) annotated by DeepFRI and eggNOG and by HUMAnN2 (informative gene ontology terms). The annotation is weighted by relative abundances normalized to copies per million. Annotation rate as a function of cluster size is shown in Fig. S3.
FIG 5
FIG 5
Pangenome patterns within the infant metagenomes. (a) Pangenome size in relation to the number of genomes (MAGs). (b) Number of unique (not shared with other species) and shared accessory genes per pangenome. See Table S1 for a full pangenome size list. (c) Sizes of the core and accessory genomes per species stratified by the functional annotation of genes using eggNOG and DeepFRI (known versus unknown function). Entries are ordered according to the size of the pangenome (20 of 42 species used to construct the pangenomes). The number of annotated genes was computed using only informative Gene Ontology sets.

Similar articles

Cited by

References

    1. Li J, Wang J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, Arumugam M, Kultima JR, Prifti E, Nielsen T, Juncker AS, Manichanh C, Chen B, Zhang W, Levenez F, Wang J, Xu X, Xiao L, Liang S, Zhang D, Zhang Z, Chen W, Zhao H, Al-Aama JY, Edris S, Yang H, Wang J, Hansen T, Nielsen HB, Brunak S, Kristiansen K, Guarner F, Pedersen O, Doré J, Ehrlich SD, Bork P. 2014. An integrated catalog of reference genes in the human gut microbiome. Nat Biotechnol 32:834–841. doi:10.1038/nbt.2942. - DOI - PubMed
    1. Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, DuLong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. 2019. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176:649–662.E20. doi:10.1016/j.cell.2019.01.001. - DOI - PMC - PubMed
    1. Baric RS, Crosson S, Damania B, Miller SI, Rubin EJ. 2016. Next-generation high-throughput functional annotation of microbial genomes. mBio 7:e01245-16. doi:10.1128/mBio.01245-16. - DOI - PMC - PubMed
    1. Yan W, Hall AB, Jiang X. 2022. Bacteroidales species in the human gut are a reservoir of antibiotic resistance genes regulated by invertible promoters. NPJ Biofilms Microbiomes 8:1. doi:10.1038/s41522-021-00260-1. - DOI - PMC - PubMed
    1. Chang YC, Hu Z, Rachlin J, Anton BP, Kasif S, Roberts RJ, Steffen M. 2016. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res 44:D330–D335. doi:10.1093/nar/gkv1324. - DOI - PMC - PubMed

Publication types

LinkOut - more resources