. 2022 Jun;32(6):1137-1151.

doi: 10.1101/gr.276362.121. Epub 2022 May 11.

Automated annotation of human centromeres with HORmon

Olga Kunyavskaya¹, Tatiana Dvorkina¹, Andrey V Bzikadze², Ivan A Alexandrov¹, Pavel A Pevzner³

Affiliations

¹ Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia, 199034.
² Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, California 92093, USA.
³ Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA.

PMID: 35545449
PMCID: PMC9248890
DOI: 10.1101/gr.276362.121

Automated annotation of human centromeres with HORmon

Olga Kunyavskaya et al. Genome Res. 2022 Jun.

. 2022 Jun;32(6):1137-1151.

doi: 10.1101/gr.276362.121. Epub 2022 May 11.

Authors

Olga Kunyavskaya¹, Tatiana Dvorkina¹, Andrey V Bzikadze², Ivan A Alexandrov¹, Pavel A Pevzner³

Affiliations

¹ Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia, 199034.
² Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, California 92093, USA.
³ Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA.

PMID: 35545449
PMCID: PMC9248890
DOI: 10.1101/gr.276362.121

Abstract

Recent advances in long-read sequencing opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. They also emphasized the need for centromere annotation (partitioning human centromeres into monomers and higher-order repeats [HORs]). Although there was a half-century-long series of semi-manual studies of centromere architecture, a rigorous centromere annotation algorithm is still lacking. Moreover, an automated centromere annotation is a prerequisite for studies of genetic diseases associated with centromeres and evolutionary studies of centromeres across multiple species. Although the monomer decomposition (transforming a centromere into a monocentromere written in the monomer alphabet) and the HOR decomposition (representing a monocentromere in the alphabet of HORs) are currently viewed as two separate problems, we show that they should be integrated into a single framework in such a way that HOR (monomer) inference affects monomer (HOR) inference. We thus developed the HORmon algorithm that integrates the monomer/HOR inference and automatically generates the human monomers/HORs that are largely consistent with the previous semi-manual inference.

PubMed Disclaimer

Figures

**Figure 1.**
The architecture of centromere on Chromosome X. The centromere of Chromosome X (cenX) consists of ∼18,100 monomers of length ≅171 bp each based on the cenX assembly in Bzikadze and Pevzner (2020); the T2T assembly (Nurk et al. 2022) represents a minor change to this assembly. These monomers are organized into ∼1500 units. Five units are colored by ﬁve shades of green illustrating unit variations. Each unit is a stacked tandem repeat formed by various monomers. The vast majority of units in cenX correspond to the canonical HOR, which is formed by 12 monomers (shown by 12 different colors). The figure on *top* represents the dot plot of the nucleotide sequence of the canonical HOR that reveals 12 monomers. Although the canonical units are 95%–100% similar, monomers are only 65%–88% similar. In addition to the canonical 12-monomer units, cenX has a small number of partial and auxiliary HORs with varying numbers of monomers.

**Figure 2.**
HORmon pipeline. Given the nucleotide sequence *Centromere* and a consensus alpha satellite sequence *Monomer*, HORmon iteratively launches StringDecomposer (Dvorkina et al. 2020) to partition *Centromere* into monomer blocks. After each launch of StringDecomposer, HORmon launches CentromereArchitect (Dvorkina et al. 2021) to cluster similar monomer blocks into monomers, identify hybrid monomers (represented by a single hybrid D/E of monomers D and E), and transform *Centromere* into the monocentromere Centromere*. Afterward, HORmon uses the generated monocentromere to construct a monomer graph (red edges connect the hybrid monomer D/E with the rest of the monomer graph). To comply with the centromere evolution postulate, HORmon performs split/merge transformations and dehybridizations on the initial monomer set. The orange dotted undirected edge connects similar monomers A and B to indicate that they represent candidates for merging. The breakable monomer D is shown as a dotted vertex to indicate that it is a candidate for splitting into monomers D′ and D′′. The dehybridization substitutes the hybrid vertex D′/E by a single red edge that connects the prefix of D′ with the suffix of E. Split, merge, and dehybridization operations result in a new monomer set and transform *Centromere* into the monocentromere Centromere**. The black cycle in the monomer graph of Centromere* represents the HOR; the purple edge connecting monomers G and C is a low-frequency chord in this cycle. HORmon uses this HOR to generate the HOR decomposition of Centromere** into the canonical (c_F, c_C), partial (p_(A ₊ _B)C, p_FG, p_CE), and auxiliary (the single block D′/E) HORs. c_F and c_C refer to traversing the (canonical) HOR starting from monomers F and C, respectively. p_(A ₊ _B)C, p_FG, and p_CE refer to partial traversals of the HOR from monomer A + B to C, from F to G, and from C to E, respectively.

**Figure 3.**
The monomer graph of cenX in the CHM13 (*top*) and HG002 (*bottom*) genomes. The monomer graphs of cenX were constructed on the monocentromere that was generated from the monomer sets consisting of two infrequent hybrid monomers (labeled as MX and NX) and 12 frequent canonical monomers (labeled as AX, BX, CX, …, KX, and LX) that contribute to the canonical DXZ1 HOR in cenX (Dvorkina et al. 2021). Small font corresponds to the naming conventions introduced in Shepelev et al. (2015). The hybrid monomers M and N are inferred in Dvorkina et al. (2020). A hybrid monomer formed by frequent monomers X and Y is represented as a bicolored vertex (two colors correspond to the colors of X and Y) and is denoted as (X/Y). Only edges of the monomer graph with multiplicity exceeding 1 are shown (edges with multiplicity exceeding 100 are shown in bold). The cycle formed by bold edges (with multiplicities above 1500) traverses the 12 most frequent monomers that form the canonical cenX HOR.

**Figure 4.**
The simplified monomer graphs of human centromeres. The first 23 subfigures contain simplified monomer graphs for all live human centromeres in the CHM13 cell line (centromere ID shown in the subcaption). The 24th subfigure corresponds to the centromere on Chromosome Y in the HG002 genome. In each graph, vertices represent the monomer set *Monomers* of the corresponding *Centromere*. The label of each vertex represents the monomer ID and its count in the monocentromere Centromere* (in parentheses). The ID of the monomers follow the naming convention introduced in Shepelev et al. (2015). Two monomers are connected by an edge if they are consecutive in monocentromere Centromere*. The weight of an edge connecting monomers M and M′ is defined as the number of times M is followed by M′ in Centromere*. The width of an edge (color of a vertex) reflects its multiplicity (count of a monomer). In each graph, HORmon detects heavy nonoverlapping cycles and paths and removes chords in such cycles (for details, see Methods). The isolated cycles in 18 centromeres (2, 3, 4, 6, 7, 10, 11, 12, 14, 15, 16, 17, 19, 20, 21, 22, X, and Y) represent HORs in these centromeres. (*Figure continues on following pages.*)

**Figure 5.**
Inferring HORs for cen1, cen9, cen13, and cen18. (First row) Splitting an unbreakable junction monomer in cen1 results in two monomers with an 11-nt difference and transforms the monomer graph of cen1 into a cycle with a single chord. (Second row) The manually inferred HOR of cen9 (McNulty and Sullivan 2018), shown as the blue cycle, is in conflict with the CE postulate because the frequently traversed yellow cycle contains a monomer that does not belong to the blue cycle. (Third row) Splitting an unbreakable junction monomer in cen13 results in two similar monomers with an only 3-nt difference and transforms the monomer graph of cen13 (Fig. 4) into a cycle with a single chord shown on the *left*. The resulting simplified monomer graph (shown on the *right*) reveals the canonical 11-monomer HOR in cen13. (Fourth row) Splitting an unbreakable junction monomer in cen18 results in two monomers with only a single-nucleotide difference and transforms the simplified monomer graph of cen18 (Fig. 4) into a cycle with three chords (shown on the *left*). The resulting simplified monomer graph (shown on the *right*) reveals the canonical 12-monomer HOR in cen18. (*Figure continues on following page.*)

**Figure 6.**
Dehybridization substitutes hybrid vertices (monomers) by hybrid edges in the monomer graph. (*Top*) Dehybridization of P5 and R1/5/19 in cen5. (*Bottom*) Dehybridization of L8 in cen8.

**Figure 7.**
Decomposition of cenX into HORs. The 12-monomer HOR for cenX is represented as *M₁… M₁₂* = AB…KL. The monomer set includes these 12 frequent monomers as well as hybrid monomers M (a hybrid of monomers J and H) and N (a hybrid of monomers K and J) identified in Dvorkina et al. (2020). Each occurrence of this HOR that starts from the monomer M_i is labeled as c_i (shown in red). Each occurrence of a partial HOR that includes monomers from i to j is labeled as *p_i,j*. We use the notation c^m (p^m) to denote m consecutive occurrences of a canonical (partial) HOR. The most frequent partial monomers p_3-7, p_7-3, and p_5-2 in cenX are colored in blue, green, and brown, respectively. The HOR decomposition of cenX has a length 72 and includes 1486 complete HORs that form 34 HOR runs. Only 257 of 18,089 (1.4%) monomer blocks in cenX are not covered by complete HORs. The “LINE” entry shows the position of the LINE element. To ensure that all monomers are shown in the forward strand, we decompose the reverse complement of cenX and take reverse-complements of all monomers in cenX (Supplemental Note 4).

**Figure 8.**
Two different “monocentromeres” BABABABABABCBCBCBCBCB and BABCBABCBABCBABCBABCB have the identical monomer graphs.

See this image and copyright information in PMC

Cited by

De novo reconstruction of satellite repeat units from sequence data.
Zhang Y, Chu J, Cheng H, Li H. Zhang Y, et al. ArXiv [Preprint]. 2023 Apr 19:arXiv:2304.09729v1. ArXiv. 2023. Update in: Genome Res. 2023 Dec 1;33(11):1994-2001. doi: 10.1101/gr.278005.123. PMID: 37131874 Free PMC article. Updated. Preprint.
Novel Cascade Alpha Satellite HORs in Orangutan Chromosome 13 Assembly: Discovery of the 59mer HOR-The largest Unit in Primates-And the Missing Triplet 45/27/18 HOR in Human T2T-CHM13v2.0 Assembly.
Glunčić M, Vlahović I, Rosandić M, Paar V. Glunčić M, et al. Int J Mol Sci. 2024 Jul 11;25(14):7596. doi: 10.3390/ijms25147596. Int J Mol Sci. 2024. PMID: 39062839 Free PMC article.
Novel Concept of Alpha Satellite Cascading Higher-Order Repeats (HORs) and Precise Identification of 15mer and 20mer Cascading HORs in Complete T2T-CHM13 Assembly of Human Chromosome 15.
Glunčić M, Vlahović I, Rosandić M, Paar V. Glunčić M, et al. Int J Mol Sci. 2024 Apr 16;25(8):4395. doi: 10.3390/ijms25084395. Int J Mol Sci. 2024. PMID: 38673983 Free PMC article.
Envisioning a new era: Complete genetic information from routine, telomere-to-telomere genomes.
Miga KH, Eichler EE. Miga KH, et al. Am J Hum Genet. 2023 Nov 2;110(11):1832-1840. doi: 10.1016/j.ajhg.2023.09.011. Am J Hum Genet. 2023. PMID: 37922882 Free PMC article. Review.
The Satellite DNA PcH-Sat, Isolated and Characterized in the Limpet Patella caerulea (Mollusca, Gastropoda), Suggests the Origin from a Nin-SINE Transposable Element.
Petraccioli A, Maio N, Carotenuto R, Odierna G, Guarino FM. Petraccioli A, et al. Genes (Basel). 2024 Apr 25;15(5):541. doi: 10.3390/genes15050541. Genes (Basel). 2024. PMID: 38790169 Free PMC article.

See all "Cited by" articles

References

1. Ahuja RK, Magnati TL, Orlin JB. 1993. Network flows: theory, algorithms, and applications. Prentice-Hall, Upper Saddle River, NJ.
1. Alexandrov I, Kazakov A, Tumeneva I, Shepelev V, Yurov Y. 2001. α-Satellite DNA of primates: old and new families. Chromosoma 110: 253–266. 10.1007/s004120100146 - DOI - PubMed
1. Alkan C, Ventura M, Archidiacono N, Rocchi M, Sahinalp SC, Eichler EE. 2007. Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data. PLoS Comput Biol 3: e181. 10.1371/journal.pcbi.0030181 - DOI - PMC - PubMed
1. Altemose N, Logsdon GA, Bzikadze AV, Sidhwani P, Langley SA, Caldas GV, Hoyt SJ, Uralsky L, Ryabov FD, Shew CJ, et al. 2022. Complete genomic and epigenetic maps of human centromeres. Science 376: eabl4178. 10.1126/science.abl4178 - DOI - PMC - PubMed
1. Black EM, Giunta S. 2018. Repetitive fragile sites: centromere satellite DNA as a source of genome instability in human diseases. Genes (Basel) 9: 615. 10.3390/genes9120615 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated annotation of human centromeres with HORmon

Affiliations

Automated annotation of human centromeres with HORmon

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources