Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Mar 14:13:811495.
doi: 10.3389/fmicb.2022.811495. eCollection 2022.

Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation

Affiliations
Review

Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation

Alban Mathieu et al. Front Microbiol. .

Abstract

Shotgun sequencing of environmental DNA (i.e., metagenomics) has revolutionized the field of environmental microbiology, allowing the characterization of all microorganisms in a sequencing experiment. To identify the microbes in terms of taxonomy and biological activity, the sequenced reads must necessarily be aligned on known microbial genomes/genes. However, current alignment methods are limited in terms of speed and can produce a significant number of false positives when detecting bacterial species or false negatives in specific cases (virus, plasmids, and gene detection). Moreover, recent advances in metagenomics have enabled the reconstruction of new genomes using de novo binning strategies, but these genomes, not yet fully characterized, are not used in classic approaches, whereas machine and deep learning methods can use them as models. In this article, we attempted to review the different methods and their efficiency to improve the annotation of metagenomic sequences. Deep learning models have reached the performance of the widely used k-mer alignment-based tools, with better accuracy in certain cases; however, they still must demonstrate their robustness across the variety of environmental samples and across the rapid expansion of accessible genomes in databases.

Keywords: classification; deep learning; functional annotation; machine learning; metagenomic; taxonomic annotation; whole genome shotgun.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Factors that influence the capacity of sequence annotation. Parameters, defined in the sequencing and bioinformatic processes, are tunable by the users. Intrinsic factors are some characteristics of the environment studied that influence the rate of annotation, by definition they are not tunable. The cursors indicate where the annotation rate will be the highest. A low sequence identity cutoff for assignment increases the annotation rate, but the trade-off will be a higher detection rate of false positives. Precision of the annotation refers to the degree of annotation examined (for taxonomic assignment, it corresponds to the taxonomic range used for the analysis, for the functional annotation to the metabolic/anabolic level: genes, short biosynthetic pathways, and global pathways).
FIGURE 2
FIGURE 2
Schematization of deep learning models. The encoded input represents a metagenomic DNA sequence or k-mer that will be transformed using the activation function in the hidden layers. Each gray circle in the hidden layers represents a cell that will communicate its output with the other cells. As mentioned in the text, LSTM models possess a “forget” gate that selects relevant information. The final output of the hidden layers is the classification with a predicted probability for an input to be in one of the categories. During the training, the probability is encoded by the SoftMax function, whereas, for the final testing, the argMAX function is used, a most understandable function that gives probabilities between 0 and 1.

References

    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
    1. Arango-Argoty G., Garner E., Pruden A., Heath L. S., Vikesland P., Zhang L. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6 1–15. 10.1186/s40168-018-0401-z - DOI - PMC - PubMed
    1. Bahram M., Netherway T., Frioux C., Ferretti P., Coelho L. P., Geisen S., et al. (2021). Metagenomic assessment of the global diversity and distribution of bacteria and fungi. Environ. Microbiol. 23 316–326. 10.1111/1462-2920.15314 - DOI - PMC - PubMed
    1. Beghini F., McIver L. J., Blanco-Míguez A., Dubois L., Asnicar F., Maharjan S., et al. (2021). Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3. Elife 10:e65088. 10.7554/eLife.65088 - DOI - PMC - PubMed
    1. Breitwieser F. P., Lu J., Salzberg S. L. (2019). A review of methods and databases for metagenomic classification and assembly. Brief Bioinform. 20 1125–1136. 10.1093/bib/bbx120 - DOI - PMC - PubMed