Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 28;4(1):vbae191.
doi: 10.1093/bioadv/vbae191. eCollection 2024.

Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm

Affiliations

Efficient genome monomer higher-order structure annotation and identification using the GRMhor algorithm

Matko Glunčić et al. Bioinform Adv. .

Abstract

Motivation: Tandem monomeric units, integral components of eukaryotic genomes, form higher-order repeat (HOR) structures that play crucial roles in maintaining chromosome integrity and regulating gene expression and protein abundance. Given their significant influence on processes such as evolution, chromosome segregation, and disease, developing a sensitive and automated tool for identifying HORs across diverse genomic sequences is essential.

Results: In this study, we applied the GRMhor (Global Repeat Map hor) algorithm to analyse the centromeric region of chromosome 20 in three individual human genomes, as well as in the centromeric regions of three higher primates. In all three human genomes, we identified six distinct HOR arrays, which revealed significantly greater differences in the number of canonical and variant copies, as well as in their overall structure, than would be expected given the 99.9% genetic similarity among humans. Furthermore, our analysis of higher primate genomes, which revealed entirely different HOR sequences, indicates a much larger genomic divergence between humans and higher primates than previously recognized. These results underscore the suitability of the GRMhor algorithm for studying specificities in individual genomes, particularly those involving repetitive monomers in centromere structure, which is essential for proper chromosome segregation during cell division, while also highlighting its utility in exploring centromere evolution and other repetitive genomic regions.

Availability and implementation: Source code and example binaries freely available for download at github.com/gluncic/GRM2023.

PubMed Disclaimer

Conflict of interest statement

All authors of this article declare that they have no conflicts of interest.

Figures

Figure 1.
Figure 1.
Schematic representation of a monomer array and HORs. Each monomer is represented by a single square. Monomers within HOR unit are labeled as m1, m2, , in order of their appearance (from left to right within each HOR). Monomers exhibiting <5% sequence divergence are depicted in the same color and labeled with the same identifier (t1, t2,). A group of three monomers is sequentially repeated to form a higher-order structure known as a 3mer canonical HOR. HOR2, HOR4, and HOR5 are variant HORs due to the insertion (monomer t4 in HOR2, monomer t2 in HOR5) and deletion (monomer t2 in HOR4) of one monomer.
Figure 2.
Figure 2.
Scheme of an example monomer sequence with 16 monomers, illustrating the first step of the algorithm. The array M consists of 16 2D vectors, M=1,3,2,3,3,4,4,4,5,4,6,0,7,3,8,3,9,5,10,2,11,2,12,4, 13,0, 14,1, 15,0,16,0. Each monomer is represented by a single square. Monomers within HOR unit are labeled as m1, m2, , in order of their appearance (from left to right within each HOR). Monomers exhibiting <5% sequence divergence are depicted in the same color and labeled with the same identifier (t1, t2,).
Figure 3.
Figure 3.
GRM diagrams (a, d, g, k), MD diagrams (b, e, h, l), and higher-order repeat (HOR) schemes (c, f, j, k) for artificial monomers sequences. (a–c) Willard’s canonical alpha satellite HORs. (d–f) Willard’s canonical and variant HORs (n = 12, τ = 12) (12 monomers of 12 different types). (g–j) Cascading alpha satellite canonical HORs (n = 10, τ = 11) (10 monomers of 11 different types). (k–m) Randomly distributed alpha satellite monomers. Period denotes the distance between two similar monomers in monomer units. Index denotes the ordinal number of the monomer in the monomeric array. Monomers within HOR unit are labeled as m1, m2, … mn, in order of their appearance from left to right within each row and from top to bottom. Each monomer is depicted by a colored box, with distinct colors corresponding to different monomer types. Monomers are organized into columns based on their monomer types: monomer type t1 in the first column, monomer type t2 in the second column, and so forth. The number of columns, that is, the number of different monomer types in the canonical HOR unit, is denoted by τ.
Figure 4.
Figure 4.
Global repeat map (GRM) diagram and MD diagram for tandemly arranged alpha satellite monomers in the complete assemblies of human chromosome 20: (a) T2T-CHM13v2.0 (GCA_009914755.4), (b) Pangenome Reference Consortium HG002 (GCA_018873775.2), and (c) Pangenome Reference Consortium HG01243 (GCA_018873775.2). GRM diagrams: Horizontal axis: GRM periods (in monomer units). Vertical axis: frequency of monomer repeats period. Identified GRM peaks exhibit periods 2, 4, 6, 8, 10, 11, 16, 18, and 26. The significance of these GRM peaks (HORs or associated subfragment repeats) can be inferred from the monomer distance (MD) diagram. MD diagram: Horizontal axis: enumeration of tandemly organized alpha satellite monomers, in sequential order as revealed by GRM analysis of the assembly. Vertical axis: period (the distance between start of a monomer and of the next monomer of the same type (see Fig. 2)). Four distinct regions of monomer tandems are denoted by A, B, C, D, E, and F. Additionally, there are sporadic MD points that do not correspond to HORs or their subfragments.
Figure 5.
Figure 5.
Ideogram of major alpha satellite HOR arrays in the CHM13v2.0 assembly of human chromosome 20.
Figure 6.
Figure 6.
Aligned schemes of the canonical HOR units from all HOR regions in three individual assemblies of human chromosome 20. (a) 8mer HOR in region A. (b) 16mer HOR in region B. (c) 11mer HOR in region C. (d) 8mer HOR in region D. (e) 18mer HOR in region E. (f) 26mer HOR in region F. The first row in each panel shows the corresponding HOR consensus from chromosome 20 of the CHM13 genome, the second row from the HG002 genome, and the third row from the HG01243 genome. Monomers within HOR copy are labeled as m1, m2, mn, in order of their appearance (from left to right within each row and from top to bottom). Each monomer is depicted by a colored box, with distinct colors corresponding to different monomer types. Monomers are organized into columns based on their monomer types: monomer type t1 in the first column, monomer type t2 in the second column, and so forth. The number of columns, that is, the number of different monomer types in the canonical HOR copy, is denoted by τ.
Figure 7.
Figure 7.
Global repeat map (GRM) diagram and MD diagram for tandemly arranged alpha satellite monomers in the complete assemblies of higher primates: (a) Chimpanzee (GCA_028858775.2), (b) Gorilla (GCA029281585.2), and (c) Orangutan (GCA_028885655.2). GRM diagrams: Horizontal axis: GRM periods (in monomer units). Vertical axis: frequency of monomer repeats period. The significance of identified GRM peaks (HORs or associated subfragment repeats) can be inferred from the monomer distance (MD) diagram. MD diagram: Horizontal axis: enumeration of tandemly organized alpha satellite monomers, in sequential order as revealed by GRM analysis of the assembly. Vertical axis: period (the distance between start of a monomer and of the next monomer of the same type (see Fig. 2)).

Similar articles

References

    1. Alexandrov I, Kazakov A, Tumeneva I. et al. Alpha-satellite DNA of primates: old and new families. Chromosoma 2001;110:253–66. - PubMed
    1. Alexandrov IA, Mashkova TD, Akopian TA. et al. Chromosome-specific alpha satellites: two distinct families on human chromosome 18. Genomics 1991;11:15–23. - PubMed
    1. Altemose N, Logsdon GA, Bzikadze AV. et al. Complete genomic and epigenetic maps of human centromeres. Science 2022;376:eabl4178. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
    1. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999;27:573–80. - PMC - PubMed

LinkOut - more resources