Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 27:11:568546.
doi: 10.3389/fgene.2020.568546. eCollection 2020.

PENGUINN: Precise Exploration of Nuclear G-Quadruplexes Using Interpretable Neural Networks

Affiliations

PENGUINN: Precise Exploration of Nuclear G-Quadruplexes Using Interpretable Neural Networks

Eva Klimentova et al. Front Genet. .

Abstract

G-quadruplexes (G4s) are a class of stable structural nucleic acid secondary structures that are known to play a role in a wide spectrum of genomic functions, such as DNA replication and transcription. The classical understanding of G4 structure points to four variable length guanine strands joined by variable length nucleotide stretches. Experiments using G4 immunoprecipitation and sequencing experiments have produced a high number of highly probable G4 forming genomic sequences. The expense and technical difficulty of experimental techniques highlights the need for computational approaches of G4 identification. Here, we present PENGUINN, a machine learning method based on Convolutional neural networks, that learns the characteristics of G4 sequences and accurately predicts G4s outperforming state-of-the-art methods. We provide both a standalone implementation of the trained model, and a web application that can be used to evaluate sequences for their G4 potential.

Keywords: G quadruplex; bioinformatics and computational biology; deep neural network; genomic; imbalanced data classification; machine learning; web application.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
(A) Schematic of a typical G-quadruplex structure consisting of four G tracts with a minimum length of three, connected by non-specific loops. (B) PENGUINN convolutional neural network model. (C) Identification of G-quadruplex subsequences via randomized mutation.
FIGURE 2
FIGURE 2
(A) F1 score for PENGUINNp (precise), PENGUINNs (sensitive) and Regular Expression with datasets of different pos:neg ratio. (B) Precision-Recall curve comparison of PENGUINN and best performing state-of-the-art method G4detector K_PDS and Regular Expression in datasets of different pos:neg ratio.

References

    1. Bailey T. L., Williams N., Misleh C., Li W. W. (2006). MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 34 W369–W373. - PMC - PubMed
    1. Barshai M., Orenstein Y. (2019). “Predicting G-quadruplexes from DNA sequences using multi-kernel convolutional neural networks,” in Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics - BCB ’19 New York, NY.
    1. Bedrat A., Lacroix L., Mergny J. L. (2016). Re-Evaluation of G-Quadruplex Propensity with G4Hunter. Nucleic Acids Res. 44 1746–1759. 10.1093/nar/gkw006 - DOI - PMC - PubMed
    1. Chambers V. S., Marsico G., Boutell J. M., Di Antonio M., Smith G. P., Balasubramanian S. (2015). High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat. Biotechnol. 33 877–881. 10.1038/nbt.3295 - DOI - PubMed
    1. Emmert-Streib F., Yang Z., Feng H., Tripathi S., Dehmer M. (2020). An introductory review of deep learning for prediction models with big data. Front. Artif. Intellig. 3:4 10.3389/frai.2020.00004 - DOI - PMC - PubMed

LinkOut - more resources