Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec;11(4):628-635.
doi: 10.1007/s12539-018-0313-4. Epub 2018 Dec 27.

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction

Affiliations

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction

Amani Al-Ajlan et al. Interdiscip Sci. 2019 Dec.

Abstract

Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.

Keywords: Convolutional neural network; Deep learning; Gene prediction; Metagenomics; ORF.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
One-hot Encoding for DNA sequence. Each nucleotide is represented as a one-hot vector: A = 1000, T = 0001, C = 0100, and G = 0010
Fig. 2
Fig. 2
CNN-MGP Architecture. First, an ORF is encoded numerically using one-hot encoding; then, a matrix of numbers is inputted into an appropriate CNN-MGP model based on its fragment GC content. The CNN-MGP model consists of six layers. The first layer is a convolutional layer with 64 filters and a filter window size of 21. The second layer is a max-pooling layer with a pool size of 2. The third layer is a convolutional layer with 200 filters and a filter window size of 21, and the fourth layer is a max-pooling layer with a pool size of 2. Then, the output is flattened to a 1D vector before being inputted into a fully connected layer with 128 neurons. Then, the output layer produces a final gene probability

Similar articles

Cited by

  • Machine learning applications in RNA modification sites prediction.
    El Allali A, Elhamraoui Z, Daoud R. El Allali A, et al. Comput Struct Biotechnol J. 2021 Sep 29;19:5510-5524. doi: 10.1016/j.csbj.2021.09.025. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 34712397 Free PMC article. Review.
  • Analysis of metagenomic data.
    Liu S, Rodriguez JS, Munteanu V, Ronkowski C, Sharma NK, Alser M, Andreace F, Blekhman R, Błaszczyk D, Chikhi R, Crandall KA, Della Libera K, Francis D, Frolova A, Gancz AS, Huntley NE, Jaiswal P, Kosciolek T, Łabaj PP, Łabaj W, Luan T, Mason C, Moustafa AM, Muralidharan HS, Mutlu O, Mansouri Ghiasi N, Rahnavard A, Sun F, Tian S, Tierney BT, Van Syoc E, Vicedomini R, Zackular JP, Zelikovsky A, Zielińska K, Ganda E, Davenport ER, Pop M, Koslicki D, Mangul S. Liu S, et al. Nat Rev Methods Primers. 2025;5:5. doi: 10.1038/s43586-024-00376-6. Epub 2025 Jan 23. Nat Rev Methods Primers. 2025. PMID: 40688383 Free PMC article.
  • A toolbox of machine learning software to support microbiome analysis.
    Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. Marcos-Zambrano LJ, et al. Front Microbiol. 2023 Nov 22;14:1250806. doi: 10.3389/fmicb.2023.1250806. eCollection 2023. Front Microbiol. 2023. PMID: 38075858 Free PMC article. Review.
  • Application and Comparison of Supervised Learning Strategies to Classify Polarity of Epithelial Cell Spheroids in 3D Culture.
    Soetje B, Fuellekrug J, Haffner D, Ziegler WH. Soetje B, et al. Front Genet. 2020 Mar 27;11:248. doi: 10.3389/fgene.2020.00248. eCollection 2020. Front Genet. 2020. PMID: 32292417 Free PMC article.
  • Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification.
    Akotenou G, El Allali A. Akotenou G, et al. Brief Bioinform. 2025 Jul 2;26(4):bbaf311. doi: 10.1093/bib/bbaf311. Brief Bioinform. 2025. PMID: 40605274 Free PMC article.

References

    1. Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data analysis. Microb Inf Exp. 2012;2(1):3. - PMC - PubMed
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. - PMC - PubMed
    1. Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G. High throughput sequencing methods and analysis for microbiome research. J Microbiol Methods. 2013;95(3):401–414. - PubMed
    1. Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Compu Biol. 2005;1(2):e24. - PMC - PubMed
    1. Bashir Y, Pradeep Singh S, Kumar Konwar B. Metagenomics: an application based perspective. Chin J Biol. 2014;2014:146030.

LinkOut - more resources