. 2017 Dec 1;18(1):533.

doi: 10.1186/s12859-017-1941-0.

Decontaminating eukaryotic genome assemblies with machine learning

Janna L Fierst¹, Duncan A Murdock²

Affiliations

¹ Department of Biological Sciences, University of Alabama, Tuscaloosa, 35487, AL, USA. jlfierst@ua.edu.
² Department of Biological Sciences, University of Alabama, Tuscaloosa, 35487, AL, USA.

PMID: 29191179
PMCID: PMC5709863
DOI: 10.1186/s12859-017-1941-0

Decontaminating eukaryotic genome assemblies with machine learning

Janna L Fierst et al. BMC Bioinformatics. 2017.

. 2017 Dec 1;18(1):533.

doi: 10.1186/s12859-017-1941-0.

Authors

Janna L Fierst¹, Duncan A Murdock²

Affiliations

¹ Department of Biological Sciences, University of Alabama, Tuscaloosa, 35487, AL, USA. jlfierst@ua.edu.
² Department of Biological Sciences, University of Alabama, Tuscaloosa, 35487, AL, USA.

PMID: 29191179
PMCID: PMC5709863
DOI: 10.1186/s12859-017-1941-0

Abstract

Background: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism.

Results: We introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols.

Conclusions: A decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets.

Keywords: Contamination; DNA sequencing; Genome assembly; High-throughput; Sequence filtering.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
The workflow from raw DNA sequence reads to assembled genome sequence for Anvi’o with CONCOCT binning, Busybee, Blobology, Kraken, and the decision tree. Both Blobology and Kraken required pre-assembly, filtering for target and contaminant reads, and final assembly. The decision tree, Anvi’o and Busybee filtered for target and contaminant scaffolds by constructing models and classifying contiguous sequences after assembly

**Fig. 2**
The top 20 organisms identified in BLAST analysis of the empirical genome sequences for (a) *C. remanei* (b) *C. latens* (c) *A. vaga*. For *C. remanei* the most common BLAST hit was *C. remanei*, followed by two likely contaminants and scaffolds that could not be assigned origin with BLAST. For *C. latens* the most common BLAST hit was the microbial contaminant *S. matophilia* followed by *C. remanei*, a second contaminant *P. protegens*, and scaffolds that could not be assigned origin. For *A. vaga* the majority of scaffolds could not be assigned origin with BLAST, likely due to the low number of rotifer sequences in public databases

**Fig. 3**
Accuracy, sensitivity and specificity for (a) decision tree and (b) bagging decision tree models. Decision tree models achieved high accuracy, sensitivity and specificity but were influenced by variation in the training dataset. The bagging decision tree model achieves high accuracy, sensitivity and specificity with lower variance between models constructed with different training datasets. For the decision tree models accuracy, sensitivity and specificity plateau with >25% of the data used in training while the performance of the bagging model plateaus with >40% of the data used in training

**Fig. 4**
Accuracy, sensitivity and specificity for (a) random forest and (b) boosted decision tree models. Both random forest and boosted decision tree models resulted in high accuracy, sensitivity and specificity but showed non-monotonic responses to the training datasets

**Fig. 5**
GC content and the average per-base sequencing coverage for individual scaffolds in the empirical datasets (a) *C. remanei* training; (b) *C. remanei* full dataset; (c) *C. latens* training; (d) *C. latens* full dataset; (e) *A. vaga* training; and (f) *A. vaga* full dataset. Training datasets with BLAST-identified origins are shown on the left and decision tree bagging model predictions for full datasets are shown on the right with model error

**Fig. 6**
Accuracy, sensitivity and specificity for (a) the decision tree bagging model constructed with 2-8 predictors and (b) Anvi’o with CONCOCT binning and Busybee. Acccuracy and sensitivity for the decision tree bagging model plateau with 4 predictors but small increases in specificity resulted from additional predictors. Anvi’o had the highest specificity compared to the decision tree bagging model or Busybee while Busybee had the highest sensitivity

**Fig. 7**
Busybee bin 4 (a) contained primarily scaffolds of *Caenorhabditis* or unknown origin with few microbial contaminants while Busybee bin 3 (b) was a hetereogeneous mix of sequences with different origins. The scaffolds in Busybee bin 3 separated by taxonomic origin when visualized by scaffold GC content and sequencing coverage

**Fig. 8**
GC content and average per-base sequencing coverage for the simulated datasets contaminated with microbial DNA. Training datasets are shown on the left and bagging decision tree predictions are shown on the right for a-b) *A. thaliana*; c-d) *C. elegans*; e-f) *D. melanogaster*; and g-h) *T. rubripes*. The microbial genomes were GC-rich relative to the target organisms and a simple decision tree based on GC content and sequencing coverage predicted scaffold origin with low error for each dataset

**Fig. 9**
GC content and average per-base sequencing coverage for the simulated datasets contaminated with *C. albicans* DNA. Training datasets and bagging decision tree predictions are shown for a-b) *A. thaliana*; c-d) *C. elegans*; e-f) *D. melanogaster*; and g-h) *T. rubripes*. *C. albicans* and the target organisms had similar GC contents and the bagging decision tree predictions were based on a complex relationship that included multiple predictors and mRNA data

See this image and copyright information in PMC

References

1. Kumar S, Blaxter ML. Simultaneous genome sequencing of symbionts and their hosts. Symbiosis. 2012;55(3):119–26. doi: 10.1007/s13199-012-0154-6. - DOI - PMC - PubMed
1. Artamanova II, Lappi T, Zudina L, Mushegian AR. Prokaryotic genes in eukaryotic genome sequences: when to infer horizontal gene transfer and when to suspect an actual microbe. Environ Microbiol. 2015;17(7):2203–8. doi: 10.1111/1462-2920.12854. - DOI - PubMed
1. Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci. 2016;113:5053–8. doi: 10.1073/pnas.1600338113. - DOI - PMC - PubMed
1. Artamanova II, Mushegian AR. Genome seuqence analysis indicates that the model eukaryotic Nematostella vectensis harbors bacterial consorts. Appl Environ Microbiol. 2013;79(22):6868–73. doi: 10.1128/AEM.01635-13. - DOI - PMC - PubMed
1. Laurence M, Hatzis C, Brash DE. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS ONE. 2014;9(5):e97876. doi: 10.1371/journal.pone.0097876. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Decontaminating eukaryotic genome assemblies with machine learning

Affiliations

Decontaminating eukaryotic genome assemblies with machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases