Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 18;25(1):217.
doi: 10.1186/s12859-024-05842-2.

SATIN: a micro and mini satellite mining tool of total genome and coding regions with analysis of perfect repeats polymorphism in coding regions

Affiliations

SATIN: a micro and mini satellite mining tool of total genome and coding regions with analysis of perfect repeats polymorphism in coding regions

Carlos Willian Dias Dantas et al. BMC Bioinformatics. .

Abstract

Background: Tandem repeats are specific sequences in genomic DNA repeated in tandem that are present in all organisms. Among the subcategories of TRs we have Satellite repeats, that is divided into macrosatellites, minisatellites, and microsatellites, being the last two of specific interest because they can identify polymorphisms between organisms due to their instability. Currently, most mining tools focus on Simple Sequence Repeats (SSR) mining, and only a few can identify SSRs in the coding regions.

Results: We developed a microsatellite mining software called SATIN (Micro and Mini SATellite IdentificatioN tool) based on a new sliding window algorithm written in C and Python. It represents a new approach to SSR mining by addressing the limitations of existing tools, particularly in coding region SSR mining. SATIN is available at https://github.com/labgm/SATIN.git . It was shown to be the second fastest for perfect and compound SSR mining. It can identify SSRs from coding regions plus SSRs with motif sizes bigger than 6. Besides the SSR mining, SATIN can also analyze SSRs polymorphism on coding-regions from pre-determined groups, and identify SSRs differentially abundant among them on a per-gene basis. To validate, we analyzed SSRs from two groups of Escherichia coli (K12 and O157) and compared the results with 5 known SSRs from coding regions. SATIN identified all 5 SSRs from 237 genes with at least one SSR on it.

Conclusions: The SATIN is a novel microsatellite search software that utilizes an innovative sliding window technique based on a numerical list for repeat region search to identify perfect, and composite SSRs while generating comprehensible and analyzable outputs. It is a tool capable of using files in fasta or GenBank format as input for microsatellite mining, also being able to identify SSRs present in coding regions for GenBank files. In conclusion, we expect SATIN to help identify potential SSRs to be used as genetic markers.

Keywords: Microsatellite; SATIN; Simple sequence repeats.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Example of SSRs searching algorithm used in SATIN. A sequence with AT repetitions where the sequence is converted into a numerical list (L) and then into a multiples list (LC) with a motif size of 2 (m). From there, a search mechanism is used on the LC list, considering two neighboring k-mers at a time that repeat in tandem, to identify the SSRs
Fig. 2
Fig. 2
Diagram illustrating the process of counting the SSR in coding regions to identify potential SSR markers after the SSRs mining. The figure depicts the abundance calculation process, where the SSRs are counted on a per-gene basis. Subsequently, the output is analyzed by an R-script that compares the selected SSR among the previously selected groups (Group1 compared to Group2 in the example above)
Fig. 3
Fig. 3
Flowchart of the steps for the perfect SSR analysis of the coding regions among different groups of genomes. The first step is shown in Fig. 3 where the SSR on a per gene basis is counted and saved on a file called “SSR_couting.txt” (abundance file), then is analyzed together with a grouping file by an Rscript to generate results with some statistical analysis such as tests for normality (Shapiro–Wilk), non-parametric Kruskal–Wallis test, parametric ANOVA, Tukey's post hoc test, and a sum of the SSR counts for each gene and SSR. After the SSR has been selected the user can select the flanking regions using a script called “extract_seq_from_ssr_gene.py”
Fig. 4
Fig. 4
Box plot of the processing time for each of 100 genomes with detection of SSRs under the same parameters. The circles above each box plot represent outliers
Fig. 5
Fig. 5
Venn diagram comparing the shared or unique SSR (motif) regions based on the output generated by the programs under the same search conditions. The value at the center indicates the number of motifs identified in common by all three software programs. The three values immediately following, in light blue, brown, and purple, indicate the motifs shared between MISA-IMEX, MISA-SATIN, and IMEX-SATIN, respectively. The remaining values represent the motifs uniquely identified by each software: Green—uniquely identified by MISA, Navy blue—uniquely identified by IMEX, and Red—uniquely identified by SATIN

References

    1. Gemayel R, Cho J, Boeynaems S, Verstrepen KJ. Beyond junk-variable tandem repeats as facilitators of rapid evolution of regulatory and coding sequences. Genes. 2012;3(3):461–480. doi: 10.3390/genes3030461. - DOI - PMC - PubMed
    1. Sawaya S, Bagshaw A, Buschiazzo E, Kumar P, Chowdhury S, Black MA, et al. Microsatellite tandem repeats are abundant in human promoters and are associated with regulatory elements. PLoS ONE. 2013;8(2):e54710. doi: 10.1371/journal.pone.0054710. - DOI - PMC - PubMed
    1. Vieira MLC, Santini L, Diniz AL, Munhoz CF. Microsatellite markers: what they mean and why they are so useful. Genet Mol Biol. 2016;39(3):312–328. doi: 10.1590/1678-4685-GMB-2016-0027. - DOI - PMC - PubMed
    1. Dumbovic G, Forcales SV, Perucho M. Emerging roles of macrosatellite repeats in genome organization and disease development. Epigenetics. 2017;12(7):515–526. doi: 10.1080/15592294.2017.1318235. - DOI - PMC - PubMed
    1. Mason AS. SSR genotyping. Methods Mol Biol. 2015;1245(January):77–89. doi: 10.1007/978-1-4939-1966-6. - DOI - PubMed

LinkOut - more resources