Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 28;10(1):12628.
doi: 10.1038/s41598-020-69245-y.

Convolutional neural networks improve fungal classification

Affiliations

Convolutional neural networks improve fungal classification

Duong Vu et al. Sci Rep. .

Abstract

Sequence classification plays an important role in metagenomics studies. We assess the deep neural network approach for fungal sequence classification as it has emerged as a successful paradigm for big data classification and clustering. Two deep learning-based classifiers, a convolutional neural network (CNN) and a deep belief network (DBN) were trained using our recently released barcode datasets. Experimental results show that CNN outperformed the traditional BLAST classification and the most accurate machine learning based Ribosomal Database Project (RDP) classifier on datasets that had many of the labels present in the training datasets. When classifying an independent dataset namely the "Top 50 Most Wanted Fungi", CNN and DBN assigned less sequences than BLAST. However, they could assign much more sequences than the RDP classifier. In terms of efficiency, it took the machine learning classifiers up to two seconds to classify a test dataset while it was 53 s for BLAST. The result of the current study will enable us to speed up the taxonomic assignments for the fungal barcode sequences generated at our institute as ~ 70% of them still need to be validated for public release. In addition, it will help to quickly provide a taxonomic profile for metagenomics samples.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

Figure 1
Figure 1
(A) The proportion of yeast sequences at all taxonomic levels. The smallest ring represents the class level, followed by the order, family, genus and species levels. (B) The variation of the median similarity scores of the yeast groups at all taxonomic levels. (C) The optimal thresholds and the associated best F-measures predicted for all yeast training datasets at all taxonomic levels. (D) Predicting optimal thresholds for the yeast training datasets using a series of thresholds (between 0.5 and 0.9, with a step of 0.001) at the family level. (E) Predicting optimal thresholds for the yeast training datasets using a series of thresholds (between 0.5 and 0.9, with a step of 0.001) at the order level. (F) The distribution of the yeast dataset. The sequences were colored based on the order name. The sequences of the largest order Saccharomycetales (2,427) were in green, followed by Tremellales (559) in blue, Sporidiobolales (305) in cyan, Trichosporonales (159) in pink, Filobasidiales (122) in yellow, etc. The coordinators of the sequences were generated using fMLC. The sequences were visualized using the rgl package in R (https://r-forge.r-project.org/projects/rgl/). The numbers in brackets are the numbers of the sequences in the current group. (G) The sequences were colored as in (F) except that the sequences of the Candida genus (730) were colored in red.
Figure 2
Figure 2
The MCCs obtained by different classifiers at different taxonomic levels for k = 4, 6, and 8.
Figure 3
Figure 3
The confusion matrices obtained by all the classifiers at the class level.
Figure 4
Figure 4
The average recall, precision, and F-scores obtained by different classifiers of the ten species and genera that have the largest number of sequences in the training datasets (ranging from 62 to 657 at the genus level, and ranging from 39 to 95 at the species level) when k = 6.
Figure 5
Figure 5
The distribution of the sequences of the yeast dataset together with 2,024 sequences (in black) of the “Top 50 Most Wanted Fungi”. The 730 sequences in red are the sequences of the Candida genus. The remaining sequences of the largest order Saccharomycetales (1,697) were in green, followed by Tremellales (559) in blue, Sporidiobolales (305) in cyan, Trichosporonales (159) in pink, Filobasidiales (122) in yellow, etc.

References

    1. Geml J, Pastor N, Fernandez L, et al. Large-scale fungal diversity assessment in the Andean Yungas forests reveals strong community turnover among forest types along an altitudinal gradient. Mol. Ecol. 2014;23:2452–2472. doi: 10.1111/mec.12765. - DOI - PubMed
    1. Gweon HS, Oliver A, Taylor J, et al. PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform. Methods Ecol. Evol. 2015;6:973–980. doi: 10.1111/2041-210X.12399. - DOI - PMC - PubMed
    1. Tedersoo L, Bahram M, Põlme S, et al. Global diversity and geography of soil fungi. Science. 2014;346:1256688. doi: 10.1126/science.1256688. - DOI - PubMed
    1. Schoch CL, Seifert KA, Huhndorf S, et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc. Natl. Acad. Sci. 2012;109:1–6. doi: 10.1073/iti0112109. - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

Publication types