Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 1;39(8):btad457.
doi: 10.1093/bioinformatics/btad457.

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Affiliations

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Dmitry Penzar et al. Bioinformatics. .

Abstract

Motivation: The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar.

Results: Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. Using published data, here, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of single-nucleotide variants. Furthermore, we show how LegNet can be used in a diffusion network manner for the rational design of promoter sequences yielding the desired expression level.

Availability and implementation: https://github.com/autosome-ru/LegNet. The GitHub repository includes Jupyter Notebook tutorials and Python scripts under the MIT license to reproduce the results presented in the study.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Learning and predicting promoter expression and effects of single-nucleotide variants from massive parallel reporter assays with LegNet. (A) An overall pipeline. The regression task is reformulated as the soft-classification problem mirroring the original experimental setup where cells were sorted into different bins depending on reporter protein fluorescence. Bottom: sequence encoding and prediction of the expression bin probabilities with LegNet. (B) Variant effect estimation with LegNet. Both original and mutated promoter sequences are passed separately to the trained neural network. The variant effect is estimated as a difference between corresponding predictions and compared against the ground truth experimental data.
Figure 2.
Figure 2.
A schematic representation of LegNet application to the rational design of promoters in a cold diffusion framework. The hexagonal binning plot shows the correlation between desired (target) and observed (LegNet-predicted) expression for 110 592 designed promoters; the color scale denotes the number of promoters in a bin; Pearson and Spearman correlation coefficients are shown in the top left corner.
Figure 3.
Figure 3.
LegNet accurately predicts promoter expression. Prediction of native promoter expression for yeast grown in complex medium (YPD, A) and defined medium (SD-Ura, B), hexagonal binning plots, the color scale denotes the number of promoters in a bin. Comparison of LegNet prediction performance for native yeast promoter sequences compared to the transformer model of Vaishnav et al.; (C) Pearson correlation between predictions and ground truth; (D) Spearman correlation; note the Y-axis lower limit. Violin plots show bootstrap with n = 10 000. *P<0.001, Silver's dependent correlations test (Silver et al. 2004) for the total data.
Figure 4.
Figure 4.
LegNet demonstrates better prediction of variant effects for yeast grown in complex (A and B) and defined (C and D) medium compared to the transformer model of Vaishnav et al. (A and C) Pearson correlation between predictions and ground truth; (B and D) Spearman correlation; note the Y-axis lower limit. Violin plots show bootstrap with n = 10 000. *P<0.0001, Silver's dependent correlations test (Silver et al. 2004) for the total data.

References

    1. Avdeyev P, Shi C, Tan Y et al. Dirichlet diffusion score model for biological sequence generation. arXiv, 2023. 10.48550/arXiv.2305.10699. - DOI
    1. Avsec Z, Agarwal V, Visentin D et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021;18:1196–203. - PMC - PubMed
    1. Bansal A, Borgnia E, Chu H-M et al. Cold diffusion: inverting arbitrary image transforms without noise. arXiv, 2022. 10.48550/ARXIV.2208.09392. - DOI
    1. Bello I, Fedus W, Du X et al. Revisiting ResNets: improved training and scaling strategies. arXiv, 2021. 10.48550/ARXIV.2103.07579. - DOI
    1. Boeva V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Front Genet 2016;7:24. - PMC - PubMed

Publication types