. 2023 Aug 1;39(8):btad457.

doi: 10.1093/bioinformatics/btad457.

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Dmitry Penzar^{1

2

3}, Daria Nogina⁴, Elizaveta Noskova⁴, Arsenii Zinkevich^{1

4}, Georgy Meshcheryakov², Andrey Lando⁵, Abdul Muntakim Rafi⁶, Carl de Boer⁶, Ivan V Kulakovskiy^{1

2

7}

Affiliations

¹ Vavilov Institute of General Genetics, Moscow 119991, Russia.
² Institute of Protein Research, Pushchino 142290, Russia.
³ Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia.
⁴ Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia.
⁵ Yandex N.V., Moscow 119021, Russia.
⁶ School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
⁷ Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan 420008, Russia.

PMID: 37490428
PMCID: PMC10400376
DOI: 10.1093/bioinformatics/btad457

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Dmitry Penzar et al. Bioinformatics. 2023.

. 2023 Aug 1;39(8):btad457.

doi: 10.1093/bioinformatics/btad457.

Authors

Affiliations

¹ Vavilov Institute of General Genetics, Moscow 119991, Russia.
² Institute of Protein Research, Pushchino 142290, Russia.
³ Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia.
⁴ Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia.
⁵ Yandex N.V., Moscow 119021, Russia.
⁶ School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
⁷ Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan 420008, Russia.

PMID: 37490428
PMCID: PMC10400376
DOI: 10.1093/bioinformatics/btad457

Abstract

Motivation: The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar.

Results: Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. Using published data, here, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of single-nucleotide variants. Furthermore, we show how LegNet can be used in a diffusion network manner for the rational design of promoter sequences yielding the desired expression level.

Availability and implementation: https://github.com/autosome-ru/LegNet. The GitHub repository includes Jupyter Notebook tutorials and Python scripts under the MIT license to reproduce the results presented in the study.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Learning and predicting promoter expression and effects of single-nucleotide variants from massive parallel reporter assays with LegNet. (A) An overall pipeline. The regression task is reformulated as the soft-classification problem mirroring the original experimental setup where cells were sorted into different bins depending on reporter protein fluorescence. Bottom: sequence encoding and prediction of the expression bin probabilities with LegNet. (B) Variant effect estimation with LegNet. Both original and mutated promoter sequences are passed separately to the trained neural network. The variant effect is estimated as a difference between corresponding predictions and compared against the ground truth experimental data.

**Figure 2.**
A schematic representation of LegNet application to the rational design of promoters in a cold diffusion framework. The hexagonal binning plot shows the correlation between desired (target) and observed (LegNet-predicted) expression for 110 592 designed promoters; the color scale denotes the number of promoters in a bin; Pearson and Spearman correlation coefficients are shown in the top left corner.

**Figure 3.**
LegNet accurately predicts promoter expression. Prediction of native promoter expression for yeast grown in complex medium (YPD, A) and defined medium (SD-Ura, B), hexagonal binning plots, the color scale denotes the number of promoters in a bin. Comparison of LegNet prediction performance for native yeast promoter sequences compared to the transformer model of Vaishnav *et al.*; (C) Pearson correlation between predictions and ground truth; (D) Spearman correlation; note the Y-axis lower limit. Violin plots show bootstrap with n = 10 000. * $P < 0.001$ , Silver's dependent correlations test (Silver *et al.* 2004) for the total data.

**Figure 4.**
LegNet demonstrates better prediction of variant effects for yeast grown in complex (A and B) and defined (C and D) medium compared to the transformer model of Vaishnav *et al.* (A and C) Pearson correlation between predictions and ground truth; (B and D) Spearman correlation; note the Y-axis lower limit. Violin plots show bootstrap with n = 10 000. * $P < 0.0001$ , Silver's dependent correlations test (Silver *et al.* 2004) for the total data.

See this image and copyright information in PMC

References

1. Avdeyev P, Shi C, Tan Y et al. Dirichlet diffusion score model for biological sequence generation. arXiv, 2023. 10.48550/arXiv.2305.10699. - DOI
1. Avsec Z, Agarwal V, Visentin D et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021;18:1196–203. - PMC - PubMed
1. Bansal A, Borgnia E, Chu H-M et al. Cold diffusion: inverting arbitrary image transforms without noise. arXiv, 2022. 10.48550/ARXIV.2208.09392. - DOI
1. Bello I, Fedus W, Du X et al. Revisiting ResNets: improved training and scaling strategies. arXiv, 2021. 10.48550/ARXIV.2103.07579. - DOI
1. Boeva V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Front Genet 2016;7:24. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Affiliations

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources