Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 20:20:1914-1924.
doi: 10.1016/j.csbj.2022.04.014. eCollection 2022.

Scalable in-memory processing of omics workflows

Affiliations

Scalable in-memory processing of omics workflows

Vadim Elisseev et al. Comput Struct Biotechnol J. .

Abstract

We present a proof of concept implementation of the in-memory computing paradigm that we use to facilitate the analysis of metagenomic sequencing reads. In doing so we compare the performance of POSIX™file systems and key-value storage for omics data, and we show the potential for integrating high-performance computing (HPC) and cloud native technologies. We show that in-memory key-value storage offers possibilities for improved handling of omics data through more flexible and faster data processing. We envision fully containerized workflows and their deployment in portable micro-pipelines with multiple instances working concurrently with the same distributed in-memory storage. To highlight the potential usage of this technology for event driven and real-time data processing, we use a biological case study focused on the growing threat of antimicrobial resistance (AMR). We develop a workflow encompassing bioinformatics and explainable machine learning (ML) to predict life expectancy of a population based on the microbiome of its sewage while providing a description of AMR contribution to the prediction. We propose that in future, performing such analyses in 'real-time' would allow us to assess the potential risk to the population based on changes in the AMR profile of the community.

Keywords: Bioinformatics; Cloud; HPC; Key-value store; Machine learning; Metagenomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
Architecture stack of the environment used to study in-memory workflows processing.
Fig. 2
Fig. 2
Example bioinformatic workflow to enable analysis of metagenomic samples from untreated sewage.
Fig. 3
Fig. 3
Workflow Implementation - Handling of a paired read sample in the proposed architecture. File I/O denotes reads and writes from/to a parallel file system. Socket I/O denotes reads and writes to an external in-memory key-value store.
Fig. 4
Fig. 4
Performance profile of the base line pipeline for different input FASTQ file sizes.
Fig. 5
Fig. 5
Schematic for in-memory duplicate removal using external key-value store.
Fig. 6
Fig. 6
Performance profile of the external key-value store duplicate removal filter using DBR APIs (blue) and Hiredis APIs (orange). 1563892 FASTQ records, 220 bytes each have been used.
Fig. 7
Fig. 7
Schematic for in-memory duplicate removal using MPI and a local memory approach.
Fig. 8
Fig. 8
Performance profile of the local memory duplicate removal filter.
Fig. 9
Fig. 9
Comparison of performance of ML regression analyses. Comparing ML model performance: (a) using the full dataset 223 samples x 7126 AMRs, (b) for the best model from (a) Random Forest with sequentially reduced numbers of features, (c) using the 223 samples x 30 selected AMRs from (b), (d) showing the true versus predicted values for the test dataset when processed through the Neural Network from (c).
Fig. 10
Fig. 10
Local ML model explanation for 3 countries with the lowest life expectancy (a-c) and 3 countries with the highest life expectancy (d-f). Rows denote the AMR genes in ranked order from top to bottom according to the feature importance or impact on the models predictions. The values on the x-axis denote the SHAP calculated impact value of the related AMR gene on the models prediction of life expactancy in years for that sample.

References

    1. Novella J.A., Emami Khoonsari P., Herman S., Whitenack D., Capuccini M., Burman J., Kultima K., Spjuth O. Container-based bioinformatics with Pachyderm. Bioinformatics. 2018;35(5):839–846. - PMC - PubMed
    1. Jackman S.D., Mozgacheva T., Chen S., O’Huiginn B., Bailey L., Birol I., Jones S.J.M. ORCA: a comprehensive bioinformatics container environment for education and research. Bioinformatics. 2019;35(21):4448–4450. - PMC - PubMed
    1. Kintsakis A.M., Psomopoulos F.E., Symeonidis A.L., Mitkas P.A. Hermes: Seamless delivery of containerized bioinformatics workflows in hybrid cloud (htc) environments. SoftwareX. 2017;6:217–224. doi: 10.1016/j.softx.2017.07.007. https://www.sciencedirect.com/science/article/pii/S2352711017300304. - DOI
    1. Merchant N., Lyons E., Goff S., Vaughn M., Ware D., Micklos D., Antin P. The iplant collaborative: Cyberinfrastructure for enabling data to discovery for the life sciences. PLOS Biology. 2016;14:1–9. doi: 10.1371/journal.pbio.1002342. - DOI - PMC - PubMed
    1. Gupta S., Imani M., Khaleghi B., Kumar V., Rosing T. 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) 2019. Rapid: A reram processing in-memory architecture for dna sequence alignment; pp. 1–6. - DOI

LinkOut - more resources