Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Sep 15;15(5):1367-1378.
doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct.

Bioinformatics tools for the sequence complexity estimates

Affiliations
Review

Bioinformatics tools for the sequence complexity estimates

Yuriy L Orlov et al. Biophys Rev. .

Abstract

We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.

Keywords: Alignment-free; Bioinformatics; Entropy; Genetic codes; Genome comparison; Genomic rearrangement; Lempel–Ziv compression; Low complexity regions; Online tools; Sequence information; Sequencing artefacts; Text complexity.

PubMed Disclaimer

Conflict of interest statement

Conflict of interestThe authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Classification of the methods for sequence complexity analysis

References

    1. Abnizova I, te Boekhorst R, Walter K, Gilks WR. Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics. 2005;6:109. doi: 10.1186/1471-2105-6-109. - DOI - PMC - PubMed
    1. Abnizova I, Walter K, Te Boekhorst R, Elgar G, Gilks WR. Statistical information characterization of conserved non-coding elements in vertebrates. J Bioinform Comput Biol. 2007;5(2B):533–547. doi: 10.1142/s0219720007002898. - DOI - PubMed
    1. Abnizova I, te Boekhorst R, Orlov Y. Computational errors and biases of short read next generation sequencing. J Proteom Bioinform. 2017;10:1–17. doi: 10.4172/jpb.1000420. - DOI
    1. Agenis-Nevers M, Bokde ND, Yaseen ZM, Shende MK. An empirical estimation for time and memory algorithm complexities: newly developed R package. Multimed Tools Appl. 2021;80(2):2997–3015. doi: 10.1007/s11042-020-09471-8. - DOI
    1. Akbari Rokn Abadi S, Mohammadi A, Koohi S. A new profiling approach for DNA sequences based on the nucleotides’ physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics. 2023;24(1):266. doi: 10.1186/s12864-023-09373-7. - DOI - PMC - PubMed

LinkOut - more resources