Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;40(6):btae340.
doi: 10.1093/bioinformatics/btae340.

SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

Affiliations

SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models

Ketrin Gjoni et al. Bioinformatics. .

Abstract

Summary: The increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences.

Availability and implementation: SuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) Schematic representation of SuPreMo. SuPreMo generates sequences by incorporating perturbations into the hg38 human reference genome. SuPreMo-Akita applies Akita to those sequences and generates 3D genome disruption scores (effect size of each perturbation) and, optionally, disruption tracks and predicted contact frequency maps. Parameters and outputs are specified. REF: derived from reference allele; ALT: derived from alternate allele; Log(Obs/Exp): log of observed over expected contacts; MSE: mean squared error. (B) Categorization of SVs based on the ability of SuPreMo and other existing tools to incorporate them into a reference genome. SVs that other tools can already process include small indels (green); SVs that only SuPreMo can process include deletions, duplications, inversions, and chromosomal rearrangements (navy); SVs that no tool can process include insertions and copy number variants (CNVs) because the exact sequence is not provided by upstream variant calling pipelines (gray). Datasets are WGS/WES from healthy and disease individuals: a reference cancer cell line (Talsania et al. 2022), a reference genome (Zook et al. 2020), neurodegenerative disease gene sequences (Kaivola et al. 2023), and the 1K Genome Project (Mahmoud et al. 2019).

Update of

References

    1. Auton A, Brooks LD, Durbin RM, 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 2015;526:68–74. - PMC - PubMed
    1. Agarwal V, Shendure J.. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep 2020;31:107663. - PubMed
    1. Avsec Ž, Agarwal V, Visentin D. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021a;18:1196–203. - PMC - PubMed
    1. Avsec Ž, Weilert M, Shrikumar A. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021b;53:354–66. - PMC - PubMed
    1. Benegas G, Batra SS, Song YS. DNA language models are powerful predictors of genome-wide variant effects. bioRxiv, 2022.08.22.504706, 2023. - PMC - PubMed