Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec 1;18(1):533.
doi: 10.1186/s12859-017-1941-0.

Decontaminating eukaryotic genome assemblies with machine learning

Affiliations

Decontaminating eukaryotic genome assemblies with machine learning

Janna L Fierst et al. BMC Bioinformatics. .

Abstract

Background: High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism.

Results: We introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols.

Conclusions: A decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets.

Keywords: Contamination; DNA sequencing; Genome assembly; High-throughput; Sequence filtering.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
The workflow from raw DNA sequence reads to assembled genome sequence for Anvi’o with CONCOCT binning, Busybee, Blobology, Kraken, and the decision tree. Both Blobology and Kraken required pre-assembly, filtering for target and contaminant reads, and final assembly. The decision tree, Anvi’o and Busybee filtered for target and contaminant scaffolds by constructing models and classifying contiguous sequences after assembly
Fig. 2
Fig. 2
The top 20 organisms identified in BLAST analysis of the empirical genome sequences for (a) C. remanei (b) C. latens (c) A. vaga. For C. remanei the most common BLAST hit was C. remanei, followed by two likely contaminants and scaffolds that could not be assigned origin with BLAST. For C. latens the most common BLAST hit was the microbial contaminant S. matophilia followed by C. remanei, a second contaminant P. protegens, and scaffolds that could not be assigned origin. For A. vaga the majority of scaffolds could not be assigned origin with BLAST, likely due to the low number of rotifer sequences in public databases
Fig. 3
Fig. 3
Accuracy, sensitivity and specificity for (a) decision tree and (b) bagging decision tree models. Decision tree models achieved high accuracy, sensitivity and specificity but were influenced by variation in the training dataset. The bagging decision tree model achieves high accuracy, sensitivity and specificity with lower variance between models constructed with different training datasets. For the decision tree models accuracy, sensitivity and specificity plateau with >25% of the data used in training while the performance of the bagging model plateaus with >40% of the data used in training
Fig. 4
Fig. 4
Accuracy, sensitivity and specificity for (a) random forest and (b) boosted decision tree models. Both random forest and boosted decision tree models resulted in high accuracy, sensitivity and specificity but showed non-monotonic responses to the training datasets
Fig. 5
Fig. 5
GC content and the average per-base sequencing coverage for individual scaffolds in the empirical datasets (a) C. remanei training; (b) C. remanei full dataset; (c) C. latens training; (d) C. latens full dataset; (e) A. vaga training; and (f) A. vaga full dataset. Training datasets with BLAST-identified origins are shown on the left and decision tree bagging model predictions for full datasets are shown on the right with model error
Fig. 6
Fig. 6
Accuracy, sensitivity and specificity for (a) the decision tree bagging model constructed with 2-8 predictors and (b) Anvi’o with CONCOCT binning and Busybee. Acccuracy and sensitivity for the decision tree bagging model plateau with 4 predictors but small increases in specificity resulted from additional predictors. Anvi’o had the highest specificity compared to the decision tree bagging model or Busybee while Busybee had the highest sensitivity
Fig. 7
Fig. 7
Busybee bin 4 (a) contained primarily scaffolds of Caenorhabditis or unknown origin with few microbial contaminants while Busybee bin 3 (b) was a hetereogeneous mix of sequences with different origins. The scaffolds in Busybee bin 3 separated by taxonomic origin when visualized by scaffold GC content and sequencing coverage
Fig. 8
Fig. 8
GC content and average per-base sequencing coverage for the simulated datasets contaminated with microbial DNA. Training datasets are shown on the left and bagging decision tree predictions are shown on the right for a-b) A. thaliana; c-d) C. elegans; e-f) D. melanogaster; and g-h) T. rubripes. The microbial genomes were GC-rich relative to the target organisms and a simple decision tree based on GC content and sequencing coverage predicted scaffold origin with low error for each dataset
Fig. 9
Fig. 9
GC content and average per-base sequencing coverage for the simulated datasets contaminated with C. albicans DNA. Training datasets and bagging decision tree predictions are shown for a-b) A. thaliana; c-d) C. elegans; e-f) D. melanogaster; and g-h) T. rubripes. C. albicans and the target organisms had similar GC contents and the bagging decision tree predictions were based on a complex relationship that included multiple predictors and mRNA data

References

    1. Kumar S, Blaxter ML. Simultaneous genome sequencing of symbionts and their hosts. Symbiosis. 2012;55(3):119–26. doi: 10.1007/s13199-012-0154-6. - DOI - PMC - PubMed
    1. Artamanova II, Lappi T, Zudina L, Mushegian AR. Prokaryotic genes in eukaryotic genome sequences: when to infer horizontal gene transfer and when to suspect an actual microbe. Environ Microbiol. 2015;17(7):2203–8. doi: 10.1111/1462-2920.12854. - DOI - PubMed
    1. Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci. 2016;113:5053–8. doi: 10.1073/pnas.1600338113. - DOI - PMC - PubMed
    1. Artamanova II, Mushegian AR. Genome seuqence analysis indicates that the model eukaryotic Nematostella vectensis harbors bacterial consorts. Appl Environ Microbiol. 2013;79(22):6868–73. doi: 10.1128/AEM.01635-13. - DOI - PMC - PubMed
    1. Laurence M, Hatzis C, Brash DE. Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes. PLoS ONE. 2014;9(5):e97876. doi: 10.1371/journal.pone.0097876. - DOI - PMC - PubMed