Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec;11(4):628-635.
doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction

Affiliations

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction

Amani Al-Ajlan et al. Interdiscip Sci. 2019 Dec.

Abstract

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

Keywords: Convolutional neural network; Deep learning; Gene prediction; Metagenomics; ORF.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
One-hot Encoding for DNA sequence. Each nucleotide is represented as a one-hot vector: A = 1000, T = 0001, C = 0100, and G = 0010
Fig. 2
Fig. 2
CNN-MGP Architecture. First, an ORF is encoded numerically using one-hot encoding; then, a matrix of numbers is inputted into an appropriate CNN-MGP model based on its fragment GC content. The CNN-MGP model consists of six layers. The first layer is a convolutional layer with 64 filters and a filter window size of 21. The second layer is a max-pooling layer with a pool size of 2. The third layer is a convolutional layer with 200 filters and a filter window size of 21, and the fourth layer is a max-pooling layer with a pool size of 2. Then, the output is flattened to a 1D vector before being inputted into a fully connected layer with 128 neurons. Then, the output layer produces a final gene probability

References

    1. Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data analysis. Microb Inf Exp. 2012;2(1):3. - PMC - PubMed
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. - PMC - PubMed
    1. Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G. High throughput sequencing methods and analysis for microbiome research. J Microbiol Methods. 2013;95(3):401–414. - PubMed
    1. Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Compu Biol. 2005;1(2):e24. - PMC - PubMed
    1. Bashir Y, Pradeep Singh S, Kumar Konwar B. Metagenomics: an application based perspective. Chin J Biol. 2014;2014:146030.

LinkOut - more resources