. 2020 Jul;6(7):mgen000393.

doi: 10.1099/mgen.0.000393.

Evaluation of methods for detecting human reads in microbial sequencing datasets

Stephen J Bush¹, Thomas R Connor^{2

3}, Tim E A Peto^{1

4

5}, Derrick W Crook^{1

4

5}, A Sarah Walker^{1

4

5}

Affiliations

¹ Nuffield Department of Medicine, University of Oxford, Oxford, UK.
² Organisms and Environment Division, School of Biosciences, Cardiff University, Cardiff, Wales, UK.
³ Public Health Wales, University Hospital of Wales, Cardiff, UK.
⁴ National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK.
⁵ National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK.

PMID: 32558637
PMCID: PMC7478626
DOI: 10.1099/mgen.0.000393

Evaluation of methods for detecting human reads in microbial sequencing datasets

Stephen J Bush et al. Microb Genom. 2020 Jul.

. 2020 Jul;6(7):mgen000393.

doi: 10.1099/mgen.0.000393.

Authors

Stephen J Bush¹, Thomas R Connor^{2

3}, Tim E A Peto^{1

4

5}, Derrick W Crook^{1

4

5}, A Sarah Walker^{1

4

5}

Affiliations

¹ Nuffield Department of Medicine, University of Oxford, Oxford, UK.
² Organisms and Environment Division, School of Biosciences, Cardiff University, Cardiff, Wales, UK.
³ Public Health Wales, University Hospital of Wales, Cardiff, UK.
⁴ National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK.
⁵ National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK.

PMID: 32558637
PMCID: PMC7478626
DOI: 10.1099/mgen.0.000393

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

Keywords: contamination; human; read depletion; read removal.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

**Fig. 1.**
Performance of 12 different methods of identifying human reads within a range of microbial read datasets (comprising 10 bacterial and three viral species sequenced at an average base-level coverage of 10- and 100-fold, respectively, each with between 1 and 10% simulated human contamination, using both 150 and 300 bp reads). All reads were simulated from, and where relevant aligned to, human genome version GRCh38.p12. The subfigures show (a) the percentage of reads per method correctly classified as human, (b) the percentage of human reads not classified as human and (c) the F-score. Note that in order to demonstrate the variance between methods in (b), the y-axis does not have the same scale as that for (a) and (c). Data for this figure are available in Table S2.

**Fig. 2.**
The percentage of reads incorrectly classified as human by nine different methods of human read detection within a range of microbial read datasets, partitioned by species. Data for this figure are available in Table S2 and constitute simulated reads at 10-fold coverage from each of 10 species supplemented with 0–10% human contamination, using both 150 and 300 bp reads. Data from three methods (the aligner SMALT and the classifiers Kraken2 and Centrifuge, each using a human-only database) are not shown. This is because these methods have a very high false positive rate across all species (Fig. 1). Data from viral datasets are not shown because in our simulations no viral read was incorrectly classified as human, by any method (Table S3).

**Fig. 3.**
Performance of 10 different methods of identifying human reads within a range of microbial read datasets (comprising 10 bacterial species sequenced at an average base-level coverage of 10-fold, each with 10 % simulated human contamination). All reads were simulated from, and where relevant aligned to, human genome version GRCh38.p12. Each point represents a simulation replicate, coloured according to method. Points are jittered to allow over-plotting. There is considerable overlap between points as many methods perform equivalently highly when using long reads. Data for this figure are available in Table S3.

**Fig. 4.**
Proportion of human reads not classified as human by nine different methods of human read detection, and their genomic location. Data for this figure are available in Table S4.

**Fig. 5.**
Performance of all two-stage combinations of nine independent methods of identifying human reads within a range of microbial read datasets (72 pairwise combinations; the data comprise 10 bacterial species, each with between 1 and 10% simulated human contamination, using both 150 and 300 bp reads). All reads were simulated from, and where relevant aligned to, human genome version GRCh38.p12. The subfigures show (a) the F-score, and (b) the percentage of human reads not classified as human. Bars in both subfigures are ordered from left to right by increasing variance, and in alphabetical order for methods with equal variance. Bowtie2 + SNAP is indicated on the axis. Data for this figure are available in Table S7.

**Fig. 6.**
Relationship between the number of human reads retained within 11 577 publicly archived bacterial read sets and the number of higher-confidence ‘common’ SNPs called using them (i.e. SNPs previously called by the 1000 Genomes Project in at least one of 26 major human populations, with the alternative allele supported by ≥2 uniquely mapped reads). Human reads were identified by aligning all reads to the human genome using Bowtie2 followed by SNAP. Data for this figure are available in Table S8.

See this image and copyright information in PMC

References

1. Meadow JF, Altrichter AE, Bateman AC, Stenson J, Brown GZ, et al. Humans differ in their personal microbial cloud. PeerJ. 2015;3:e1258. doi: 10.7717/peerj.1258. - DOI - PMC - PubMed
1. Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, et al. Next-Generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm. 2016;3:e251.:e251-e. doi: 10.1212/NXI.0000000000000251. - DOI - PMC - PubMed
1. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician's guide to Metagenomics. Microbiology and Molecular Biology Reviews. 2008;72:557–578. doi: 10.1128/MMBR.00009-08. - DOI - PMC - PubMed
1. Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics. children and population biobanks. Science. 2009;325:818–819. doi: 10.1126/science.1173284. - DOI - PubMed
1. Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4:e1000167. doi: 10.1371/journal.pgen.1000167. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

HPRU-2012-10041/DH_/Department of Health/United Kingdom

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of methods for detecting human reads in microbial sequencing datasets

Affiliations

Evaluation of methods for detecting human reads in microbial sequencing datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous