A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

doi:10.3389/fgene.2018.00155

. 2018 May 2:9:155.

doi: 10.3389/fgene.2018.00155. eCollection 2018.

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Affiliations

¹ Institute for Informatics and Telematics of CNR, Pisa, Italy.
² Department of Health Sciences, University of Eastern Piedmont Amedeo Avogadro, Novara, Italy.
³ Institute for Biomedical Technologies of CNR, Segrate, Italy.
⁴ Department of Science and Technological Innovation, University of Eastern Piedmont Amedeo Avogadro, Novara, Italy.

PMID: 29770143
PMCID: PMC5941971
DOI: 10.3389/fgene.2018.00155

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Loredana M Genovese et al. Front Genet. 2018.

. 2018 May 2:9:155.

doi: 10.3389/fgene.2018.00155. eCollection 2018.

Affiliations

¹ Institute for Informatics and Telematics of CNR, Pisa, Italy.
² Department of Health Sciences, University of Eastern Piedmont Amedeo Avogadro, Novara, Italy.
³ Institute for Biomedical Technologies of CNR, Segrate, Italy.
⁴ Department of Science and Technological Innovation, University of Eastern Piedmont Amedeo Avogadro, Novara, Italy.

PMID: 29770143
PMCID: PMC5941971
DOI: 10.3389/fgene.2018.00155

Abstract

Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.

Keywords: catalog; fuzzy tandem repeats; genic regions; measure of polymorphism; polymorphic tandem repeats; short tandem repeats; tandem repeat detection tools; variable number tandem repeats.

PubMed Disclaimer

Figures

**Figure 1**
Overall computational pipeline. Green ovals represent the main input data sets, Blue rectangles computational steps, and the red rectangle a key intermediate result. Note that the TR discovery tools are applied only to the reference genome (GRCh38) in the initial part of the pipeline. The target genomes are used in the second part of the pipeline, as input for alignment-based procedures, to measure polymorphism of the candidate TR loci.

**Figure 2**
A TR may belong to several overlapping transcripts mapped in the reference genome (top of the figure), however as transcripts may map to non overlapping genomic locations in the target genomes (bottom of the figure), a single TR on the reference GRCh38 may be associated with several TR on each target genome.

**Figure 3**
Quality Assessment for the measurement of TR expansion/contraction. Sequence S denotes a model TR in the reference genome, Sequence T denotes the corresponding TR in a target genome. Subsequence T₁ matches S₁ in the first alignment, subsequence T₂ matches S₂ in the second alignment. When S₂ is a prefix of S₁, and T₁ is adjacent to T₂: this is a *high quality* match (**Left** drawing). When S₂ is a prefix of S₁, and there is a small gap between T₁ and T₂: this is a *medium quality* match (**Middle** drawing). When there is a gap between S₁ and S₂, and there is a gap between T₁ and T₂: this is a *low quality* match (**Right** drawing).

**Figure 4**
Distribution of PTR in the census relative to the computational pipeline. In the discovery pipeline the initial part of the pipeline produces a list of candidate TR by using four TR discovery tools. In the second part these candidates are classified as polymorphic (thus included in the output PTR listing) or not polymorphic by using five target genomes. **(A)** Numbers obtained by tracing back each PTR locus to the tool (o tools) that discovered the associated TR candidate. **(B)** Numbers obtained by tracing back each PTR locus to the target genome (or genomes) giving evidence that led to to classifying it as polymorphic. Note that each PTR may be counted in more than one column of subfigures **(A,B)**.

**Figure 5**
Distribution of PTR in the census relative to the computational pipeline. In the discovery pipeline the initial part of the pipeline produces a list of candidate TR by using four TR discovery tools. In the second part these candidates are classified as polymorphic (thus included in the PTR listing) or not polymorphic by using five target genomes. **(A)** Numbers obtained by tracing back each PTR locus to the *single* tool that discovered the associated TR candidate, or to multiple tools. **(B)** Numbers obtained by tracing back each PTR locus to the *single* target genome giving evidence that led to to classifying it as polymorphic or to multiple genomes. The identification of two TR is done with Jaccard coefficient threshold j = 0.7. Note that in the pie charts **(A,B)** each PTR is counted in one and only one category.

**Figure 6**
**(A)** Average number of PTR per Mbp of genic regions (merging of gene transcripts) in each chromosome. Autosomal chromosomes are numbered from 1 to 22, sexual chromosomes X and Y. Chromosomes are sorted in decreasing PTR density values. **(B)** Distribution of PTR per number of repeating units (in the reference genome). **(C)** Distribution of PTR per total size of the PTR loci (in the reference genome), in logarithmic scale. **(D)** Classification of 37 clinically relevant loci as PTR in the census and landscape catalogs.

**Figure 7**
Ratio of the number of PTR over the number of candidate TR (PTR/TR ratio), subdivided in classes characterized by the size of the motif in the range 2–6 (color) and by the number of repeating units in the reference genome (abscisa). For each point in the graph the corresponding 95% Confidence Interval (CI) is given as a vertical bar.

**Figure 8**
Ratio of the number of PTR over the number of candidate TR (PTR/TR ratio), subdivided in classes characterized by the size of the motif in the range above 7 (included). The vertical dimension reports the PTR/TR ratio. The width corresponds to the motif size class. The depth corresponds to the number of repeating units.

**Figure 9**
**(A)** Ratio of the number of PTR over the number of candidate (PTR/TR ratio) per genomic location among the following genomic annotations: 5′-upstream, 5′-UTR, coding, introns, 3′-UTR, 3′-upstream. **(B)** Ratio of the number of PTR over the number of candidate (PTR/TR ratio) per genomic location among the following genomic annotations: genes, pseudogenes, miRNA, lncRNA, and lincRNA. The figure shows as vertical bars also the 95% Confidence Interval estimated by bootstrapping.

**Figure 10**
Number of detected TR polymorphisms classified as expansions (abscissa positive range) or contractions (abscissa negative range) and per difference in repeating units with respect to the same locus in the reference genome.

**Figure 11**
Differences in detecting candidate TR between census and landscape. In abscissa: different thresholds for the Jaccard matching formula. **(A)** All TR — Detection method for census: All Algorithms. Overlap = census items present in the landscape. No Overlap = census items not present in the landscape. **(B)** All TR — Detection method for census: All Algorithms — Overlap = landscape items present in census, No Overlap = landscape items not present in census. **(C)** All TR — Detection method for census: TRF Only — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. **(D)** All TR — Detection method for census: TRF only — Overlap = landscape items present in census, No Overlap = landscape items not present in census.

**Figure 12**
Differences in detecting PTR between census and landscape. In abscissa: different thresholds for the Jaccard matching formula. **(A)** PTR — Detection method for census: All Algorithms — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. **(B)** PTR — Detection method for census: All Algorithms — Overlap = landscape items present in census, No Overlap = landscape items not present in census. **(C)** PTR — Detection method for census: TRF Only — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. **(D)** PTR — Detection method for census: TRF only — Overlap = landscape items present in census, No Overlap = landscape items not present in census.

See this image and copyright information in PMC

Cited by

ONT in Clinical Diagnostics of Repeat Expansion Disorders: Detection and Reporting Challenges.
Kaplun L, Krautz-Peterson G, Neerman N, Schindler Y, Dehan E, Huettner CS, Baumgartner BK, Stanley C, Kaplun A. Kaplun L, et al. Int J Mol Sci. 2025 Mar 18;26(6):2725. doi: 10.3390/ijms26062725. Int J Mol Sci. 2025. PMID: 40141365 Free PMC article.
The Impact of SNCA Variations and Its Product Alpha-Synuclein on Non-Motor Features of Parkinson's Disease.
Magistrelli L, Contaldi E, Comi C. Magistrelli L, et al. Life (Basel). 2021 Aug 9;11(8):804. doi: 10.3390/life11080804. Life (Basel). 2021. PMID: 34440548 Free PMC article. Review.
Non-canonical RNA-DNA differences and other human genomic features are enriched within very short tandem repeats.
Yu H, Zhao S, Ness S, Kang H, Sheng Q, Samuels DC, Oyebamiji O, Zhao YY, Guo Y. Yu H, et al. PLoS Comput Biol. 2020 Jun 8;16(6):e1007968. doi: 10.1371/journal.pcbi.1007968. eCollection 2020 Jun. PLoS Comput Biol. 2020. PMID: 32511223 Free PMC article.
DNA Hypermethylation and Unstable Repeat Diseases: A Paradigm of Transcriptional Silencing to Decipher the Basis of Pathogenic Mechanisms.
Poeta L, Drongitis D, Verrillo L, Miano MG. Poeta L, et al. Genes (Basel). 2020 Jun 22;11(6):684. doi: 10.3390/genes11060684. Genes (Basel). 2020. PMID: 32580525 Free PMC article. Review.
Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges.
Magdy Mohamed Abdelaziz Barakat S, Sallehuddin R, Yuhaniz SS, R Khairuddin RF, Mahmood Y. Magdy Mohamed Abdelaziz Barakat S, et al. PeerJ Comput Sci. 2023 Jul 13;9:e1180. doi: 10.7717/peerj-cs.1180. eCollection 2023. PeerJ Comput Sci. 2023. PMID: 37547391 Free PMC article.

See all "Cited by" articles

References

1. Bailey J. A., Gu Z., Clark R. A., Reinert K., Samonte R. V., Schwartz S., et al. . (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 10.1126/science.1072047 - DOI - PubMed
1. Bandres E., Agirre X., Bitarte N., Ramirez N., Zarate R., Roman-Gomez J., et al. . (2009). Epigenetic regulation of microRNA expression in colorectal cancer. Int. J. Cancer 125, 2737–2743. 10.1002/ijc.24638 - DOI - PubMed
1. Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573. 10.1093/nar/27.2.573 - DOI - PMC - PubMed
1. Boeva V., Regnier M., Papatsenko D., Makeev V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676–684. 10.1093/bioinformatics/btk032 - DOI - PubMed
1. Boland C. R., Thibodeau S. N., Hamilton S. R., Sidransky D., Eshleman J. R., Burt R. W., et al. . (1998). A national cancer institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res. 58, 5248–5257. - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Bailey J. A., Gu Z., Clark R. A., Reinert K., Samonte R. V., Schwartz S., et al. . (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 10.1126/science.1072047 - DOI - PubMed

[2] Bailey J. A., Gu Z., Clark R. A., Reinert K., Samonte R. V., Schwartz S., et al. . (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 10.1126/science.1072047 - DOI - PubMed

[3] Bandres E., Agirre X., Bitarte N., Ramirez N., Zarate R., Roman-Gomez J., et al. . (2009). Epigenetic regulation of microRNA expression in colorectal cancer. Int. J. Cancer 125, 2737–2743. 10.1002/ijc.24638 - DOI - PubMed

[4] Bandres E., Agirre X., Bitarte N., Ramirez N., Zarate R., Roman-Gomez J., et al. . (2009). Epigenetic regulation of microRNA expression in colorectal cancer. Int. J. Cancer 125, 2737–2743. 10.1002/ijc.24638 - DOI - PubMed

[5] Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573. 10.1093/nar/27.2.573 - DOI - PMC - PubMed

[6] Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573. 10.1093/nar/27.2.573 - DOI - PMC - PubMed

[7] Boeva V., Regnier M., Papatsenko D., Makeev V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676–684. 10.1093/bioinformatics/btk032 - DOI - PubMed

[8] Boeva V., Regnier M., Papatsenko D., Makeev V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676–684. 10.1093/bioinformatics/btk032 - DOI - PubMed

[9] Boland C. R., Thibodeau S. N., Hamilton S. R., Sidransky D., Eshleman J. R., Burt R. W., et al. . (1998). A national cancer institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res. 58, 5248–5257. - PubMed

[10] Boland C. R., Thibodeau S. N., Hamilton S. R., Sidransky D., Eshleman J. R., Burt R. W., et al. . (1998). A national cancer institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res. 58, 5248–5257. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Affiliations

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials