Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul;9(7):mgen001051.
doi: 10.1099/mgen.0.001051.

Accelerating bioinformatics implementation in public health

Affiliations

Accelerating bioinformatics implementation in public health

Kevin G Libuit et al. Microb Genom. 2023 Jul.

Abstract

We have adopted an open bioinformatics ecosystem to address the challenges of bioinformatics implementation in public health laboratories (PHLs). Bioinformatics implementation for public health requires practitioners to undertake standardized bioinformatic analyses and generate reproducible, validated and auditable results. It is essential that data storage and analysis are scalable, portable and secure, and that implementation of bioinformatics fits within the operational constraints of the laboratory. We address these requirements using Terra, a web-based data analysis platform with a graphical user interface connecting users to bioinformatics analyses without the use of code. We have developed bioinformatics workflows for use with Terra that specifically meet the needs of public health practitioners. These Theiagen workflows perform genome assembly, quality control, and characterization, as well as construction of phylogeny for insights into genomic epidemiology. Additonally, these workflows use open-source containerized software and the WDL workflow language to ensure standardization and interoperability with other bioinformatics solutions, whilst being adaptable by the user. They are all open source and publicly available in Dockstore with the version-controlled code available in public GitHub repositories. They have been written to generate outputs in standardized file formats to allow for further downstream analysis and visualization with separate genomic epidemiology software. Testament to this solution meeting the requirements for bioinformatic implementation in public health, Theiagen workflows have collectively been used for over 5 million sample analyses in the last 2 years by over 90 public health laboratories in at least 40 different countries. Continued adoption of technological innovations and development of further workflows will ensure that this ecosystem continues to benefit PHLs.

Keywords: Terra; bioinformatics; epidemiology; genome; public health; sequencing.

PubMed Disclaimer

Conflict of interest statement

Many authors are employed by Theiagen Genomics, a for-profit private company. J.R.S., K.G.L. and G.L. are owners of Theiagen Global, a for-profit private company with a reseller agreement with Seqera Labs, the developer of Nextflow Tower.

Figures

Fig. 1.
Fig. 1.
The open bioinformatics ecosystem. Our ecosystem centres around the use of Terra and the Cloud computing platforms. We have labelled the functions of the primary components of this ecosystem using the same vocabulary as Black et al. GitHub – a code hosting platform for version control and collaboration – hosts the code for WDL workflows and dockerfiles to build Docker images. Docker Hub is a hosted library of standardized, validated container images used by WDL workflows. Terra will retrieve versioned images from Docker Hub during WDL workflow execution. Dockstore is an open registry for sharing interoperable tools and workflows. Terra pulls versioned workflows directly from Dockstore into Terra workspaces. Terra is a workflow orchestration platform that enables workflow management, auditability and validation using sharable secure workspaces. Terra also provides a browser-based portal for easy accessibility and GUI functionality for non-bioinformatics scientists. Terra is hosted on the Google Cloud platform and Microsoft Azure. These provide a secure, scalable, distributable and inexpensive computing resource.
Fig. 2.
Fig. 2.
Separation of genomic assembly and characterization from analytics and genomic epidemiology. Workflow results and output files can readily be exported from the Terra ecosystem into institutionally controlled IT environments for integration with epidemiological metadata for further analysis and visualization.
Fig. 3.
Fig. 3.
TheiaProk and kSNP3 workflow diagrams. Each workflow consists of inputs, processes and outputs represented by panels in the diagram. Dashed lines or parentheses indicate optional processes.
Fig. 4.
Fig. 4.
Core-genome phylogeny of S. enterica ser. Bareilly samples. Coloured boxes indicate that the genome was considered part of the outbreak in the original publication [85], the presence of given antibiotic resistance genes (ARGs) and plasmid types, and the number of pairwise SNPs between genomes. The phylogenetic tree in Newick format and CSV files containing the ARG and plasmid information were generated by the Theiagen kSNP3 workflow on Terra. The CSV files were merged in Microsoft Excel and then visualized in Phandango [89 ].
Fig. 5.
Fig. 5.
Number of sample analyses and workflows submitted using Theiagen workflows on Terra per month, from February 2021 to January 2023. The workflows have been grouped based on workflow name and/or functional category (Table S10). Sample-level workflows process individual samples, whereas set-level workflows process sets of samples, for example for comparative analyses. Users can launch multiple sample-level workflows simultaneously, creating a single submission for computation. For set-level workflows, a set of samples is processed in a single submission. Here, the top panel shows the number of sample analyses, each undertaken with a sample-level workflow. The central panel shows the number of submissions for sample-level workflows, which may launch analyses of multiple samples simultaneously. The bottom panel shows the number of submissions for set-level workflows, which is equivalent to the number of sets of samples analysed. The same sample may have been analysed multiple times. The raw data underlying these figures are presented in Tables S8–S10, with code to generate the analyses presented at https://github.com/theiagen/MGen-Theiagen-2023.

References

    1. Armstrong GL, MacCannell DR, Taylor J, Carleton HA, Neuhaus EB, et al. Pathogen Genomics in Public Health. Obstet Gynecol Surv. 2020;75:275–276. doi: 10.1097/01.ogx.0000666232.13540.20. - DOI
    1. Köser CU, Ellington MJ, Cartwright EJP, Gillespie SH, Brown NM, et al. Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. PLoS Pathog. 2012;8:e1002824. doi: 10.1371/journal.ppat.1002824. - DOI - PMC - PubMed
    1. Kwong JC, McCallum N, Sintchenko V, Howden BP. Whole genome sequencing in clinical and public health microbiology. Pathology. 2015;47:199–210. doi: 10.1097/PAT.0000000000000235. - DOI - PMC - PubMed
    1. Black A, MacCannell DR, Sibley TR, Bedford T. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat Med. 2020;26:832–841. doi: 10.1038/s41591-020-0935-z. - DOI - PMC - PubMed
    1. Inzaule SC, Tessema SK, Kebede Y, Ogwell Ouma AE, Nkengasong JN. Genomic-informed pathogen surveillance in Africa: opportunities and challenges. Lancet Infect Dis. 2021;21:e281–e289. doi: 10.1016/S1473-3099(20)30939-7. - DOI - PMC - PubMed

Publication types

LinkOut - more resources