. 2018 Jun 25;14(6):e1006277.

doi: 10.1371/journal.pcbi.1006277. eCollection 2018 Jun.

Removing contaminants from databases of draft genomes

Jennifer Lu^{1

2}, Steven L Salzberg^{1

2

3}

Affiliations

¹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America.
² Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States of America.
³ Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD, United States of America.

PMID: 29939994
PMCID: PMC6034898
DOI: 10.1371/journal.pcbi.1006277

Removing contaminants from databases of draft genomes

Jennifer Lu et al. PLoS Comput Biol. 2018.

. 2018 Jun 25;14(6):e1006277.

doi: 10.1371/journal.pcbi.1006277. eCollection 2018 Jun.

Authors

Jennifer Lu^{1

2}, Steven L Salzberg^{1

2

3}

Affiliations

¹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America.
² Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States of America.
³ Departments of Computer Science and Biostatistics, Johns Hopkins University, Baltimore, MD, United States of America.

PMID: 29939994
PMCID: PMC6034898
DOI: 10.1371/journal.pcbi.1006277

Abstract

Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of "clean" eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Masking procedure.**
A) The original genome is split into 100bp overlapping pseudo-reads. B) The pseudo-reads are then classified using Kraken first against the common contaminating vector sequences and the plant, viral, bacterial, archaeal, human, and mouse RefSeq database. The pseudo-reads are also classified using Kraken against non-human and non-mouse vertebrate RefSeq genomes. C) Bowtie2 is then used to align all pseudo-reads against the human genome. D) All pseudo-reads that were classified in the previous steps are masked out of the original genomes. Any remaining non-masked sequence with less than 100p is also masked. E) Finally, Dustmasker is used to mask additional low-complexity sequences.

**Fig 2. Masking results.**
Fig 2C provides an overview of sequence lengths for each eukaryotic pathogen genome masked in each step and the sequence lengths of the final cleaned genomes. As low-complexity sequences and vertebrate masked sequences are much smaller compared to the final genome length or human/bacterial/viral/plant/vector sequences, these were additionally plotted in Fig 2A and 2B for each eukaryotic pathogen genome. Low-complexity sequences were masked as a final step as well. Masked sequence lengths are also presented as percentages of the original genome length to show the percent of each genome remaining and the percent masked in each step (Fig 2D). Exact numbers are listed in **S2 Table**.

**Fig 3. Pseudo-read Kraken classifications.**
The above plot shows the 20 eukaryotic pathogen genomes with the greatest numbers of pseudo-reads that Kraken identified as matching foreign species when searching against database containing bacteria, viruses, archaea, and a limited set of vertebrate genomes. Vertebrate classifications are grouped by common categories, such as fish, birds, rodents, or primates. Primate and rodent numbers do not include human and mouse, which are counted and shown separately. S3 Table contains pseudo-read classifications for all eukaryotic pathogen genomes.

**Fig 4. Human/Mouse classified pseudo-reads.**
This plot shows the 20 genomes with the most number of pseudo-reads classified as either human or mouse. Perhaps not surprisingly, the mouse strain of malaria, P. *yoelii*, contains a substantial number of contaminant reads from mouse. S3 Table contains pseudo-read human and mouse classifications for all eukaryotic pathogen genomes.

**Fig 5. Top 10 species identified in corneal samples per database.**
The non-human reads from the 20 corneal samples were classified against four different Kraken databases: the original EuPathDB (A), EuPathDB-clean (B), RefSeq EuPathDB (C), and the final MicrobeDB (D). The plot above shows the 10 species with the most classified reads per megabase in a single corneal sample.

**Fig 6. Number of classified reads per megabase for five true species/genera compared among four databases across all corneal samples.**
The above plot compares the reads per megabase for the true pathogens in the infected samples and also shows the reads per megabase from those pathogens in the remaining corneal samples. The five true species/genera are *Acanthamoeba* (A), *Aspergillus flavus* (B), *Anncaliia algerae* (C), *Candida albicans/dubliensis* (D), and *Fusarium* (E) **S7 Table** lists classified reads per megabase for each species for each database.

See this image and copyright information in PMC

References

1. Glaser CA, Gilliam S, Schnurr D, Forghani B, Honarmand S, Khetsuriani N, et al. In search of encephalitis etiologies: diagnostic challenges in the California Encephalitis Project, 1998–2000. Clin Infect Dis. 2003;36(6):731–42. doi: 10.1086/367841 - DOI - PubMed
1. Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZM, Quick J, et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA. 2013;309(14):1502–10. doi: 10.1001/jama.2013.3231 - DOI - PubMed
1. Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA, Frimodt-Møller N, et al. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol. 2014;52(1):139–46. doi: 10.1128/JCM.02452-13 - DOI - PMC - PubMed
1. Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H, Yu G, et al. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med. 2014;370(25):2408–17. doi: 10.1056/NEJMoa1401268 - DOI - PMC - PubMed
1. Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, Rodriguez FJ, et al. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm. 2016;3(4):e251 doi: 10.1212/NXI.0000000000000251 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 GM083873/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Removing contaminants from databases of draft genomes

Affiliations

Removing contaminants from databases of draft genomes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources