Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;40(6):btae340.
doi: 10.1093/bioinformatics/btae340.

SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

Affiliations

SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

Ketrin Gjoni et al. Bioinformatics. .

Abstract

Summary: The increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences.

Availability and implementation: SuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) Schematic representation of SuPreMo. SuPreMo generates sequences by incorporating perturbations into the hg38 human reference genome. SuPreMo-Akita applies Akita to those sequences and generates 3D genome disruption scores (effect size of each perturbation) and, optionally, disruption tracks and predicted contact frequency maps. Parameters and outputs are specified. REF: derived from reference allele; ALT: derived from alternate allele; Log(Obs/Exp): log of observed over expected contacts; MSE: mean squared error. (B) Categorization of SVs based on the ability of SuPreMo and other existing tools to incorporate them into a reference genome. SVs that other tools can already process include small indels (green); SVs that only SuPreMo can process include deletions, duplications, inversions, and chromosomal rearrangements (navy); SVs that no tool can process include insertions and copy number variants (CNVs) because the exact sequence is not provided by upstream variant calling pipelines (gray). Datasets are WGS/WES from healthy and disease individuals: a reference cancer cell line (Talsania et al. 2022), a reference genome (Zook et al. 2020), neurodegenerative disease gene sequences (Kaivola et al. 2023), and the 1K Genome Project (Mahmoud et al. 2019).

Update of

Similar articles

Cited by

  • De novo structural variants in autism spectrum disorder disrupt distal regulatory interactions of neuronal genes.
    Gjoni K, Ren X, Everitt A, Shen Y, Pollard KS. Gjoni K, et al. bioRxiv [Preprint]. 2024 Nov 7:2024.11.06.621353. doi: 10.1101/2024.11.06.621353. bioRxiv. 2024. PMID: 39574698 Free PMC article. Preprint.
  • Unveiling the Genetic Landscape of Coronary Artery Disease Through Common and Rare Structural Variants.
    Iyer KR, Clarke SL, Guarischi-Sousa R, Gjoni K, Heath AS, Young EP, Stitziel NO, Laurie C, Broome JG, Khan AT, Lewis JP, Xu H, Montasser ME, Ashley KE, Hasbani NR, Boerwinkle E, Morrison AC, Chami N, Do R, Rocheleau G, Lloyd-Jones DM, Lemaitre RN, Bis JC, Floyd JS, Kinney GL, Bowden DW, Palmer ND, Benjamin EJ, Nayor M, Yanek LR, Kral BG, Becker LC, Kardia SLR, Smith JA, Bielak LF, Norwood AF, Min YI, Carson AP, Post WS, Rich SS, Herrington D, Guo X, Taylor KD, Manson JE, Franceschini N, Pollard KS, Mitchell BD, Loos RJF, Fornage M, Hou L, Psaty BM, Young KA, Regan EA, Freedman BI, Vasan RS, Levy D, Mathias RA, Peyser PA, Raffield LM, Kooperberg C, Reiner AP, Rotter JI, Jun G, de Vries PS, Assimes TL. Iyer KR, et al. J Am Heart Assoc. 2025 Feb 18;14(4):e036499. doi: 10.1161/JAHA.124.036499. Epub 2025 Feb 14. J Am Heart Assoc. 2025. PMID: 39950338 Free PMC article.
  • Interpreting the CTCF-mediated sequence grammar of genome folding with AkitaV2.
    Smaruj PN, Kamulegeya F, Kelley DR, Fudenberg G. Smaruj PN, et al. PLoS Comput Biol. 2025 Feb 4;21(2):e1012824. doi: 10.1371/journal.pcbi.1012824. eCollection 2025 Feb. PLoS Comput Biol. 2025. PMID: 39903776 Free PMC article.
  • An integrated view of the structure and function of the human 4D nucleome.
    4D Nucleome Consortium; Dekker J, Oksuz BA, Zhang Y, Wang Y, Minsk MK, Kuang S, Yang L, Gibcus JH, Krietenstein N, Rando OJ, Xu J, Janssens DH, Henikoff S, Kukalev A, Willemin A, Winick-Ng W, Kempfer R, Pombo A, Yu M, Kumar P, Zhang L, Belmont AS, Sasaki T, van Schaik T, Brueckner L, Peric-Hupkes D, van Steensel B, Wang P, Chai H, Kim M, Ruan Y, Zhang R, Quinodoz SA, Bhat P, Guttman M, Zhao W, Chien S, Liu Y, Venev SV, Plewczynski D, Azcarate II, Szabó D, Thieme CJ, Szczepińska T, Chiliński M, Sengupta K, Conte M, Esposito A, Abraham A, Zhang R, Wang Y, Wen X, Wu Q, Yang Y, Liu J, Boninsegna L, Yildirim A, Zhan Y, Chiariello AM, Bianco S, Lee L, Hu M, Li Y, Barnett RJ, Cook AL, Emerson DJ, Marchal C, Zhao P, Park P, Alver BH, Schroeder A, Navelkar R, Bakker C, Ronchetti W, Ehmsen S, Veit A, Gehlenborg N, Wang T, Li D, Wang X, Nicodemi M, Ren B, Zhong S, Phillips-Cremins JE, Gilbert DM, Pollard KS, Alber F, Ma J, Noble WS, Yue F. 4D Nucleome Consortium, et al. bioRxiv [Preprint]. 2024 Oct 27:2024.09.17.613111. doi: 10.1101/2024.09.17.613111. bioRxiv. 2024. PMID: 39484446 Free PMC article. Preprint.

References

    1. Auton A, Brooks LD, Durbin RM, 1000 Genomes Project Consortium et al.A global reference for human genetic variation. Nature 2015;526:68–74. - PMC - PubMed
    1. Agarwal V, Shendure J.. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep 2020;31:107663. - PubMed
    1. Avsec Ž, Agarwal V, Visentin D. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021a;18:1196–203. - PMC - PubMed
    1. Avsec Ž, Weilert M, Shrikumar A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021b;53:354–66. - PMC - PubMed
    1. Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. bioRxiv, 2022.08.22.504706, 2023. - PMC - PubMed