Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 2:9:155.
doi: 10.3389/fgene.2018.00155. eCollection 2018.

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Affiliations

A Census of Tandemly Repeated Polymorphic Loci in Genic Regions Through the Comparative Integration of Human Genome Assemblies

Loredana M Genovese et al. Front Genet. .

Abstract

Polymorphic Tandem Repeat (PTR) is a common form of polymorphism in the human genome. A PTR consists in a variation found in an individual (or in a population) of the number of repeating units of a Tandem Repeat (TR) locus of the genome with respect to the reference genome. Several phenotypic traits and diseases have been discovered to be strongly associated with or caused by specific PTR loci. PTR are further distinguished in two main classes: Short Tandem Repeats (STR) when the repeating unit has size up to 6 base pairs, and Variable Number Tandem Repeats (VNTR) for repeating units of size above 6 base pairs. As larger and larger populations are screened via high throughput sequencing projects, it becomes technically feasible and desirable to explore the association between PTR and a panoply of such traits and conditions. In order to facilitate these studies, we have devised a method for compiling catalogs of PTR from assembled genomes, and we have produced a catalog of PTR for genic regions (exons, introns, UTR and adjacent regions) of the human genome (GRCh38). We applied four different TR discovery software tools to uncover in the first phase 55,223,485 TR (after duplicate removal) in GRCh38, of which 373,173 were determined to be PTR in the second phase by comparison with five assembled human genomes. Of these, 263,266 are not included by state-of-the-art PTR catalogs. The new methodology is mainly based on a hierarchical and systematic application of alignment-based sequence comparisons to identify and measure the polymorphism of TR. While previous catalogs focus on the class of STR of small total size, we remove any size restrictions, aiming at the more general class of PTR, and we also target fuzzy TR by using specific detection tools. Similarly to other previous catalogs of human polymorphic loci, we focus our catalog toward applications in the discovery of disease-associated loci. Validation by cross-referencing with existing catalogs on common clinically-relevant loci shows good concordance. Overall, this proposed census of human PTR in genic regions is a shared resource (web accessible), complementary to existing catalogs, facilitating future genome-wide studies involving PTR.

Keywords: catalog; fuzzy tandem repeats; genic regions; measure of polymorphism; polymorphic tandem repeats; short tandem repeats; tandem repeat detection tools; variable number tandem repeats.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overall computational pipeline. Green ovals represent the main input data sets, Blue rectangles computational steps, and the red rectangle a key intermediate result. Note that the TR discovery tools are applied only to the reference genome (GRCh38) in the initial part of the pipeline. The target genomes are used in the second part of the pipeline, as input for alignment-based procedures, to measure polymorphism of the candidate TR loci.
Figure 2
Figure 2
A TR may belong to several overlapping transcripts mapped in the reference genome (top of the figure), however as transcripts may map to non overlapping genomic locations in the target genomes (bottom of the figure), a single TR on the reference GRCh38 may be associated with several TR on each target genome.
Figure 3
Figure 3
Quality Assessment for the measurement of TR expansion/contraction. Sequence S denotes a model TR in the reference genome, Sequence T denotes the corresponding TR in a target genome. Subsequence T1 matches S1 in the first alignment, subsequence T2 matches S2 in the second alignment. When S2 is a prefix of S1, and T1 is adjacent to T2: this is a high quality match (Left drawing). When S2 is a prefix of S1, and there is a small gap between T1 and T2: this is a medium quality match (Middle drawing). When there is a gap between S1 and S2, and there is a gap between T1 and T2: this is a low quality match (Right drawing).
Figure 4
Figure 4
Distribution of PTR in the census relative to the computational pipeline. In the discovery pipeline the initial part of the pipeline produces a list of candidate TR by using four TR discovery tools. In the second part these candidates are classified as polymorphic (thus included in the output PTR listing) or not polymorphic by using five target genomes. (A) Numbers obtained by tracing back each PTR locus to the tool (o tools) that discovered the associated TR candidate. (B) Numbers obtained by tracing back each PTR locus to the target genome (or genomes) giving evidence that led to to classifying it as polymorphic. Note that each PTR may be counted in more than one column of subfigures (A,B).
Figure 5
Figure 5
Distribution of PTR in the census relative to the computational pipeline. In the discovery pipeline the initial part of the pipeline produces a list of candidate TR by using four TR discovery tools. In the second part these candidates are classified as polymorphic (thus included in the PTR listing) or not polymorphic by using five target genomes. (A) Numbers obtained by tracing back each PTR locus to the single tool that discovered the associated TR candidate, or to multiple tools. (B) Numbers obtained by tracing back each PTR locus to the single target genome giving evidence that led to to classifying it as polymorphic or to multiple genomes. The identification of two TR is done with Jaccard coefficient threshold j = 0.7. Note that in the pie charts (A,B) each PTR is counted in one and only one category.
Figure 6
Figure 6
(A) Average number of PTR per Mbp of genic regions (merging of gene transcripts) in each chromosome. Autosomal chromosomes are numbered from 1 to 22, sexual chromosomes X and Y. Chromosomes are sorted in decreasing PTR density values. (B) Distribution of PTR per number of repeating units (in the reference genome). (C) Distribution of PTR per total size of the PTR loci (in the reference genome), in logarithmic scale. (D) Classification of 37 clinically relevant loci as PTR in the census and landscape catalogs.
Figure 7
Figure 7
Ratio of the number of PTR over the number of candidate TR (PTR/TR ratio), subdivided in classes characterized by the size of the motif in the range 2–6 (color) and by the number of repeating units in the reference genome (abscisa). For each point in the graph the corresponding 95% Confidence Interval (CI) is given as a vertical bar.
Figure 8
Figure 8
Ratio of the number of PTR over the number of candidate TR (PTR/TR ratio), subdivided in classes characterized by the size of the motif in the range above 7 (included). The vertical dimension reports the PTR/TR ratio. The width corresponds to the motif size class. The depth corresponds to the number of repeating units.
Figure 9
Figure 9
(A) Ratio of the number of PTR over the number of candidate (PTR/TR ratio) per genomic location among the following genomic annotations: 5′-upstream, 5′-UTR, coding, introns, 3′-UTR, 3′-upstream. (B) Ratio of the number of PTR over the number of candidate (PTR/TR ratio) per genomic location among the following genomic annotations: genes, pseudogenes, miRNA, lncRNA, and lincRNA. The figure shows as vertical bars also the 95% Confidence Interval estimated by bootstrapping.
Figure 10
Figure 10
Number of detected TR polymorphisms classified as expansions (abscissa positive range) or contractions (abscissa negative range) and per difference in repeating units with respect to the same locus in the reference genome.
Figure 11
Figure 11
Differences in detecting candidate TR between census and landscape. In abscissa: different thresholds for the Jaccard matching formula. (A) All TR — Detection method for census: All Algorithms. Overlap = census items present in the landscape. No Overlap = census items not present in the landscape. (B) All TR — Detection method for census: All Algorithms — Overlap = landscape items present in census, No Overlap = landscape items not present in census. (C) All TR — Detection method for census: TRF Only — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. (D) All TR — Detection method for census: TRF only — Overlap = landscape items present in census, No Overlap = landscape items not present in census.
Figure 12
Figure 12
Differences in detecting PTR between census and landscape. In abscissa: different thresholds for the Jaccard matching formula. (A) PTR — Detection method for census: All Algorithms — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. (B) PTR — Detection method for census: All Algorithms — Overlap = landscape items present in census, No Overlap = landscape items not present in census. (C) PTR — Detection method for census: TRF Only — Overlap = census items present in the landscape, No Overlap = census items not present in the landscape. (D) PTR — Detection method for census: TRF only — Overlap = landscape items present in census, No Overlap = landscape items not present in census.

Similar articles

Cited by

References

    1. Bailey J. A., Gu Z., Clark R. A., Reinert K., Samonte R. V., Schwartz S., et al. . (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 10.1126/science.1072047 - DOI - PubMed
    1. Bandres E., Agirre X., Bitarte N., Ramirez N., Zarate R., Roman-Gomez J., et al. . (2009). Epigenetic regulation of microRNA expression in colorectal cancer. Int. J. Cancer 125, 2737–2743. 10.1002/ijc.24638 - DOI - PubMed
    1. Benson G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573. 10.1093/nar/27.2.573 - DOI - PMC - PubMed
    1. Boeva V., Regnier M., Papatsenko D., Makeev V. (2006). Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 22, 676–684. 10.1093/bioinformatics/btk032 - DOI - PubMed
    1. Boland C. R., Thibodeau S. N., Hamilton S. R., Sidransky D., Eshleman J. R., Burt R. W., et al. . (1998). A national cancer institute workshop on microsatellite instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. Cancer Res. 58, 5248–5257. - PubMed