Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 7:13:814437.
doi: 10.3389/fgene.2022.814437. eCollection 2022.

MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences

Affiliations

MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences

Diego A A Morais et al. Front Genet. .

Abstract

Metagenomic studies unravel details about the taxonomic composition and the functions performed by microbial communities. As a complete metagenomic analysis requires different tools for different purposes, the selection and setup of these tools remain challenging. Furthermore, the chosen toolset will affect the accuracy, the formatting, and the functional identifiers reported in the results, impacting the results interpretation and the biological answer obtained. Thus, we surveyed state-of-the-art tools available in the literature, created simulated datasets, and performed benchmarks to design a sensitive and flexible metagenomic analysis pipeline. Here we present MEDUSA, an efficient pipeline to conduct comprehensive metagenomic analyses. It performs preprocessing, assembly, alignment, taxonomic classification, and functional annotation on shotgun data, supporting user-built dictionaries to transfer annotations to any functional identifier. MEDUSA includes several tools, as fastp, Bowtie2, DIAMOND, Kaiju, MEGAHIT, and a novel tool implemented in Python to transfer annotations to BLAST/DIAMOND alignment results. These tools are installed via Conda, and the workflow is managed by Snakemake, easing the setup and execution. Compared with MEGAN 6 Community Edition, MEDUSA correctly identifies more species, especially the less abundant, and is more suited for functional analysis using Gene Ontology identifiers.

Keywords: bioinformatics; functional annotation; metagenomics; pipeline; shotgun sequences; taxonomic classification.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
MEDUSA analysis workflow. Squares highlight the protocol steps, and third-party tools are depicted as cyan capsules. The python icon represents the tool implemented for the functional annotation.
FIGURE 2
FIGURE 2
Trimming tools benchmark. Single-end (SE) and paired-end (PE) inputs, containing 1, 5, 10, and 40 million reads, were processed by the selected tools. A phred score threshold of 20 was used for all tools. The “time” Unix command was used to measure the elapsed time, and the times depicted in the panels are the average of three runs. Panels (A,C), respectively, depict the time for SE and PE inputs using only one thread. Panels (B,D), respectively, depict the time for SE and PE inputs using four threads.
FIGURE 3
FIGURE 3
Decontamination tools benchmark for time and Matthews Correlation Coefficient. Single-end (SE) and paired-end (PE) inputs, composed by 25% (b3h1), 50% (b2h2), and 75% (b1h3) of human reads, were processed by the selected tools. The Ensembl Homo sapiens GRCh38 DNA primary assembly version 102 was used as a reference to build the indices. The “time” Unix command was used to measure the elapsed time, and the time depicted in panels (A,C) is the average of three runs. Panels (B,D), respectively, depict the Matthews Correlation Coefficient (MCC) for SE and PE inputs.
FIGURE 4
FIGURE 4
Decontamination tools misclassification benchmark. Panels (A,C), respectively, depict the false negative (FN) counts for the single-end (SE) and paired-end (PE) inputs. Panels (B,D), respectively, depict the false positive (FP) counts for the SE and PE inputs.
FIGURE 5
FIGURE 5
Taxonomic tools benchmark. Krona and BASTA require an alignment output to classify the reads, while Kaiju and Kraken accept Single-end (SE) or paired-end (PE) inputs. DIAMOND was used to align the D1 reads, and the NCBI-nr was used as reference to build the indices and databases. The “time” Unix command was used to measure the elapsed time, and the time depicted in panel (A) is the average of three runs, not taking into account the time needed to build the indices and databases. Panel (C) depicts the database size in GB, being smaller for transfer annotation tools (Krona and BASTA). Panels (B,D), respectively, depict the Matthews Correlation Coefficient (MCC) at the species and genus level. BASTA is not depicted in panel (D) as the classification took more than 20 days.
FIGURE 6
FIGURE 6
Reads correctly classified in the taxonomic analyses. True positives compared to the expected at species (A) and genus (C) levels. The proportion between these values at species (B) and genus levels (D).

Similar articles

Cited by

References

    1. Araujo F. A., Barh D., Silva A., Guimarães L., Ramos R. T. J. (2018). GO FEAT: A Rapid Web-Based Functional Annotation Tool for Genomic and Transcriptomic Data. Sci. Rep. 8, 1794. 10.1038/s41598-018-20211-9 - DOI - PMC - PubMed
    1. Babraham (2021). FastQC. Available at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (Accessed Oct 07, 2021).
    1. BBTools (2021). BBTools. Available at: http://jgi.doe.gov/data-and-tools/bb-tools/ (Accessed Oct 07, 2021).
    1. Bolger A. M., Lohse M., Usadel B. (2014). Trimmomatic: a Flexible Trimmer for Illumina Sequence Data. Bioinformatics 30, 2114–2120. 10.1093/bioinformatics/btu170 - DOI - PMC - PubMed
    1. Breitwieser F. P., Lu J., Salzberg S. L. (2019). A Review of Methods and Databases for Metagenomic Classification and Assembly. Brief. Bioinform. 20, 1125–1136. 10.1093/bib/bbx120 - DOI - PMC - PubMed