Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Jan 17:1:826370.
doi: 10.3389/fbinf.2021.826370. eCollection 2021.

Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale

Affiliations
Review

Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale

Bin Hu et al. Front Bioinform. .

Abstract

The nascent field of microbiome science is transitioning from a descriptive approach of cataloging taxa and functions present in an environment to applying multi-omics methods to investigate microbiome dynamics and function. A large number of new tools and algorithms have been designed and used for very specific purposes on samples collected by individual investigators or groups. While these developments have been quite instructive, the ability to compare microbiome data generated by many groups of researchers is impeded by the lack of standardized application of bioinformatics methods. Additionally, there are few examples of broad bioinformatics workflows that can process metagenome, metatranscriptome, metaproteome and metabolomic data at scale, and no central hub that allows processing, or provides varied omics data that are findable, accessible, interoperable and reusable (FAIR). Here, we review some of the challenges that exist in analyzing omics data within the microbiome research sphere, and provide context on how the National Microbiome Data Collaborative has adopted a standardized and open access approach to address such challenges.

Keywords: bioinformatics; infrastructure; microbial ecology; microbiome; omics.

PubMed Disclaimer

Conflict of interest statement

Author DW is employed by Polyneme LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Implementation of a data federation model in the NMDC pilot. The central site implements the NMDC Runtime API that orchestrates the data flow with a database that serves as the data registry. The Runtime validates submitted metadata against the NMDC schema and detects new jobs to be done based on submitted-data annotations. Source sites submit raw experimental data and sample metadata to the central site. Compute sites poll the Runtime for new workflow jobs to be done, claim jobs appropriate for their capabilities, and submit workflow job outputs to the central site. Storage sites store raw workflow outputs. The portal site provides a web-based interface. One site can serve as both a computing site and storage site. Arrows: 1): Portal site gets data object from HTTP server at a storage site; 2): The HTTP server retrieves data from a database; 3) A compute site deposits workflow run result data to a database at a separate storage site; 4) Compute sites claim computing jobs and provide job execution updates to the job tracking mechanism at the Central site; 5, 6, 7): A compute site can also serve as a storage site at the same time; 8) Compute jobs are associated with the sample metadata; 9) A source site submits sample metadata to the Central site; 10) Central site validates submitted sample metadata; 11) New jobs are created from the submitted samples metadata and become claimable by compute sites; 12) Sample metadata can be queried; 13) A set of rules define the type of computing jobs that can be claimed by every Compute site; 14) The Portal site queries metadata.
FIGURE 2
FIGURE 2
Code snippets of the metagenomic data workflow to illustrate the WDL best practices listed in this paper 1: example use of the “import” function (best practice point 1); 2–4: examples of using containers in WDL (best practice points 2-4 and 6); 5–8: examples of avoid site specific implementation (best practice point 4); 9: workflow metadata information (best practice point 8). The full workflow code is available from https://github.com/microbiomedata/metaG.
FIGURE 3
FIGURE 3
Example NMDC workflow metadata. Left panel shows an example JSON output snippet of a MAGSAnalysisActivity, which is a record of the metagenomic assembled genome (MAG) workflow execution. It includes generic workflow metadata (start/end time, execution resource) and MAG-specific metadata and workflow outputs. The full JSON example is available on-line (https://github.com/microbiomedata/nmdc-metadata/blob/master/examples/MAGs_activity.json). Right panel shows a visual depiction of the MAGSAnalysisActivity class in the NMDC LinkML schema (https://microbiomedata.github.io/nmdc-schema/MAGsAnalysisActivity/).

References

    1. Berg G., Rybakova D., Fischer D., Cernava T., Vergès M. C., Charles T., et al. (2020). Microbiome Definition Re-visited: Old Concepts and New Challenges. Microbiome 8 (1), 103. 10.1186/s40168-020-00875-0 - DOI - PMC - PubMed
    1. Bundy J. G., Davey M. P., Viant M. R. (2008). Environmental Metabolomics: A Critical Review and Future Perspectives. Metabolomics 5 (1), 3–21. 10.1007/s11306-008-0152-0 - DOI
    1. Buttigieg P., Morrison N., Smith B., Mungall C. J., Lewis S. E. (2013). & the ENVO ConsortiumThe Environment Ontology: Contextualising Biological and Biomedical Entities. J. Biomed. Sem 4 (1), 43. 10.1186/2041-1480-4-43 - DOI - PMC - PubMed
    1. Carvalhais L. C., Dennis P. G., Tyson G. W., Schenk P. M. (2012). Application of Metatranscriptomics to Soil Environments. J. Microbiol. Methods 91 (2), 246–251. 10.1016/j.mimet.2012.08.011 - DOI - PubMed
    1. Chen I. A., Chu K., Palaniappan K., Ratner A., Huang J., Huntemann M., et al. (2021). The IMG/M Data Management and Analysis System v.6.0: New Tools and Advanced Capabilities. Nucleic Acids Res. 49 (D1), D751–D763. 10.1093/nar/gkaa939 - DOI - PMC - PubMed