Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 16;21(1):61.
doi: 10.1186/s13007-025-01380-x.

SSR_VibraProfiler: a Python package for accurate classification of varieties using SSRs with intra-variety specificity and inter-variety polymorphism

Affiliations

SSR_VibraProfiler: a Python package for accurate classification of varieties using SSRs with intra-variety specificity and inter-variety polymorphism

Chenhao Jiang et al. Plant Methods. .

Abstract

Background: Simple sequence repeats (SSRs) are widely used as molecular markers; however, traditional development of SSR molecular markers heavily relies on experimental methods. The advancement of modern sequencing technology has provided the possibility of directly extracting SSR characteristics from sequencing data and using them for variety identification.

Results: We have developed a computational framework for variety identification, treating the presence or absence of each SSR in sequencing data as a numerical characteristic while ignoring specific loci, flanking sequences, and occurrence counts. Therefore, subsequent variety identification does not rely on experimental validation but is directly performed based on the numerical characteristic matrix. Using a formula, we measure the variance of these numerical characteristics both within and among varieties, and select SSRs that exhibit intra-variety specificity and inter-variety polymorphism, forming a 0,1 matrix. We use t-SNE (t-distributed Stochastic Neighbor Embedding) to project the matrix onto a two-dimensional plane, followed by K-means clustering of the individuals. The classification performance of the matrix is preliminarily assessed by comparing the cluster labels with the true labels, providing an initial evaluation of its effectiveness in variety detection. Ultimately, we construct a recognition model based on the SSRs matrix and apply it for variety identification. The process has been encapsulated into the package SSR_VibraProfiler, which can serve as a tool for constructing an SSR variety DNA fingerprint database. We tested this package on a Rhododendron dataset that included 40 individuals from 8 varieties. The accuracy achieved through t-SNE dimensionality reduction and K-means clustering was 100%. Furthermore, we used the leave-one-out method to validate the accuracy of our method in predicting variety, and confirmed the reliability of our method in detecting varieties. The package is freely available at https://github.com/Olcat35412/SSR_VibraProfiler .

Conclusion: We introduced SSR_VibraProfiler, a Python package for distinguishing and predicting individual varieties without a reference genome by extracting SSR numerical characteristics from next-generation sequencing data. This tool will contribute to the development, identification, and protection of new varieties.

Keywords: Rhododendron; In silico-based method; SSRs; Variety identification.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The complete process of the method and SSR_VibraProfiler. a. SSRs information collection process. b. SSRs selection process. c. Variety prediction process of the unknown individual. d. Classification evaluation of SSRs. The outer black box outlines show the content integrated by each of the three scripts in the SSR_VibraProfiler
Fig. 2
Fig. 2
The evaluation of the differentiation effect of screened SSRs and the results of leave-one-out cross-validation. a. ARI evaluation result. There are two random states in the process of dimensionality reduction and clustering; we take them from 0 to 50 and 0 to10, respectively. On the x-axis, “a, b” represent random state of t-SNE and random state of clustering, respectively. This figure displays the top 8 best results. b. The best clustering result achieved by the SSR matrix after dimensionality reduction using t-SNE and clustering using k-means (corresponding to the highest ARI value of 1 in figure a). The same color indicates that they are clustered as one variety, and the labels near the points represent the true labels. c. Result of LOO cross-validation. The innermost point on each axis represents the individual used for validation. The distance between query and sample within the model is sorted according to Euclidean Distance. The four black dashed lines correspond to the 2 closest, 4 closest, 9 closest, and 18 closest points to the query, respectively
Fig. 3
Fig. 3
Cross-validation results after down-sampled the Rhododendron dataset. a. Cross-validation result when down-sampling rate is 75%. b. Cross-validation result when down-sampling rate is 50%. The black arrows point to the individuals with incorrect cross-validation results

References

    1. Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004. 10.1038/nrg1348. - PubMed
    1. Mrázek J, Guo X, Shah A. Simple sequence repeats in prokaryotic genomes. Proc Natl Acad Sci U S A. 2007. 10.1073/pnas.0702412104. - PMC - PubMed
    1. Bagshaw AT. Functional mechanisms of microsatellite DNA in eukaryotic genomes. Genome Biol Evol. 2017. 10.1093/gbe/evx164. - PMC - PubMed
    1. Cheng T, Lin P, Zhou D, Wang H, Shi S, Shen J, Meng J, Ye X, Zheng K, Hu Xing, Zhuang Y. Development and characterization of novel EST-SSR markers for Gentiana straminea Maxim., a traditional Tibetan herb in China and cross-amplification in related species. Plant Genet Resour. 2024. 10.1017/S1479262124000224.
    1. Singh N, Choudhury DR, Tiwari G, Singh AK, Kumar S, Srinivasan K, Tyagi RK, Sharma AD, Singh NK, Singh R. Genetic diversity trend in Indian rice varieties: an analysis using SSR markers. BMC Genet. 2016. 10.1186/s12863-016-0437-7. - PMC - PubMed

LinkOut - more resources