PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples

Andreas Andrusch¹, Piotr W Dabrowski², Jeanette Klenner¹, Simon H Tausch¹, Claudia Kohl¹, Abdalla A Osman³, Bernhard Y Renard², Andreas Nitsche¹

Affiliations

¹ Highly Pathogenic Viruses (ZBS1), Robert Koch Institute, Berlin, Germany.
² Bioinformatics Unit (MF1), Robert Koch Institute, Berlin, Germany.
³ National Public Health Laboratory, Karthoum, Sudan.

PMID: 30423069
PMCID: PMC6129269
DOI: 10.1093/bioinformatics/bty595

PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples

Andreas Andrusch et al. Bioinformatics. 2018.

. 2018 Sep 1;34(17):i715-i721.

doi: 10.1093/bioinformatics/bty595.

Authors

Andreas Andrusch¹, Piotr W Dabrowski², Jeanette Klenner¹, Simon H Tausch¹, Claudia Kohl¹, Abdalla A Osman³, Bernhard Y Renard², Andreas Nitsche¹

Affiliations

¹ Highly Pathogenic Viruses (ZBS1), Robert Koch Institute, Berlin, Germany.
² Bioinformatics Unit (MF1), Robert Koch Institute, Berlin, Germany.
³ National Public Health Laboratory, Karthoum, Sudan.

PMID: 30423069
PMCID: PMC6129269
DOI: 10.1093/bioinformatics/bty595

Abstract

Motivation: Next generation sequencing (NGS) has provided researchers with a powerful tool to characterize metagenomic and clinical samples in research and diagnostic settings. NGS allows an open view into samples useful for pathogen detection in an unbiased fashion and without prior hypothesis about possible causative agents. However, NGS datasets for pathogen detection come with different obstacles, such as a very unfavorable ratio of pathogen to host reads. Alongside often appearing false positives and irrelevant organisms, such as contaminants, tools are often challenged by samples with low pathogen loads and might not report organisms present below a certain threshold. Furthermore, some metagenomic profiling tools are only focused on one particular set of pathogens, for example bacteria.

Results: We present PAIPline, a bioinformatics pipeline specifically designed to address problems associated with detecting pathogens in diagnostic samples. PAIPline particularly focuses on userfriendliness and encapsulates all necessary steps from preprocessing to resolution of ambiguous reads and filtering up to visualization in a single tool. In contrast to existing tools, PAIPline is more specific while maintaining sensitivity. This is shown in a comparative evaluation where PAIPline was benchmarked along other well-known metagenomic profiling tools on previously published well-characterized datasets. Additionally, as part of an international cooperation project, PAIPline was applied to an outbreak sample of hemorrhagic fevers of then unknown etiology. The presented results show that PAIPline can serve as a robust, reliable, user-friendly, adaptable and generalizable stand-alone software for diagnostics from NGS samples and as a stepping stone for further downstream analyses.

Availability and implementation: PAIPline is freely available under https://gitlab.com/rki_bioinformatics/paipline.

PubMed Disclaimer

Figures

**Fig. 1.**
PAIPline standard workflow: The PAIPline for Automatic Identification of Pathogens. Items colored in green indicate user-adjustable parameters or input. First, raw reads are preprocessed, including filters for read length, base quality and read composition complexity. The processed reads are then mapped against user-designated fore- and background databases. The mappings are matched to remove reads originating from background organisms. All remaining read hits are validated by BLAST using the NCBI nt database. Afterwards ambiguities are resolved and the final read assignment is set. Organisms of low interest (OLIs) are then masked, before the final result is presented

**Fig. 2.**
The F-scores on family level for all combinations of samples and benchmarked tools are shown. All tools were run with their default parameters. The transparent bars indicate the mean over all samples processed with that program and mode of operation, whereas light gray sample names indicate no recall. A higher bar generally indicates a better compromise between recall and precision, approximating better real-life performance

**Fig. 3.**
The F-scores on species level for all combinations of samples and benchmarked tools are shown. All tools were run with their default parameters. The transparent bars indicate the mean over all samples processed with that program and mode of operation, whereas light gray sample names indicate no recall. A higher bar generally indicates a better compromise between recall and precision, approximating better real-life performance

**Fig. 4.**
The wall clock times needed to complete each analysis of the given datasets by the benchmarked programs are shown. A higher bar indicates a computationally more expensive or less well parallelized process

See this image and copyright information in PMC

References

1. Ahn T.-H., et al. (2015) Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics, 31, 170–177. - PMC - PubMed
1. Altschul S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Breitwieser F.P., et al. (2017) A review of methods and databases for metagenomic classification and assembly. Brief. Bioinformatics. doi: 10.1093/bib/bbx120. - PMC - PubMed
1. Camacho C., et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. - PMC - PubMed
1. Datta S., et al. (2015) Next-generation sequencing in clinical virology: discovery of new viruses. World J. Virol., 4, 265–276. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples

Affiliations

PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources