Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 21:11:56.
doi: 10.1186/1471-2164-11-56.

Genome-wide computational prediction of tandem gene arrays: application in yeasts

Affiliations

Genome-wide computational prediction of tandem gene arrays: application in yeasts

Laurence Despons et al. BMC Genomics. .

Abstract

Background: This paper describes an efficient in silico method for detecting tandem gene arrays (TGAs) in fully sequenced and compact genomes such as those of prokaryotes or unicellular eukaryotes. The originality of this method lies in the search of protein sequence similarities in the vicinity of each coding sequence, which allows the prediction of tandem duplicated gene copies independently of their functionality.

Results: Applied to nine hemiascomycete yeast genomes, this method predicts that 2% of the genes are involved in TGAs and gene relics are present in 11% of TGAs. The frequency of TGAs with degenerated gene copies means that a significant fraction of tandem duplicated genes follows the birth-and-death model of evolution. A comparison of sequence identity distributions between sets of homologous gene pairs shows that the different copies of tandem arrayed paralogs are less divergent than copies of dispersed paralogs in yeast genomes. It suggests that paralogs included in tandem structures are more recent or more subject to the gene conversion mechanism than other paralogs.

Conclusion: The method reported here is a useful computational tool to provide a database of TGAs composed of functional or nonfunctional gene copies. Such a database has obvious applications in the fields of structural and comparative genomics. Notably, a detailed study of the TGA catalog will make it possible to tackle the fundamental questions of the origin and evolution of tandem gene clusters.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowchart of the score calculation process.
Figure 2
Figure 2
Calculation of FTB scores. For each CDS and its upstream and downstream surrounding regions, the X (plus strand) and Y (minus strand) values were TBLASTN bit scores expressed as a percentage of the TBLASTN bit self-score: X or Y = (bit score/bit self-score) × 100. The difference between X and Y and between Y and X were calculated to minimize the background noise that could be due to the low complexity of some DNA regions. The result is four FTB scores per CDS, two per surrounding region one of which has a value ≥ 0 and the other has the same value but ≤ 0.
Figure 3
Figure 3
TGA extraction. From each CDS with at least one FTB score ≥ 10 (CDS n represented by a black box), its FTB scores and those of its two adjacent CDSs (CDSs n-1 and n+1 represented by grey boxes) are compared to decide if it belongs to a TGA and to determine its position within the TGA. CDSs constitute a TGA if their profile of FTB scores corresponds to one of those shown in the Table. The symbol + indicates an FTB score ≥ 10 and the symbol - signifies an FTB score < 10. Letters as subscript of an FTB score refer to the target sequence flanking the analyzed CDS (downstream or upstream and the plus or minus strand). Three positions within a TGA are considered for a CDS: either the CDS begins the TGA or occupies a central position or ends the TGA. The four possible orientations of two CDSs in a TGA are described when the CDS analyzed occupies the last position of the TGA. A TGA containing at least one CDS and one gene relic (represented by a grey striped box) fulfils the conditions indicated in the last two lines of the table.
Figure 4
Figure 4
Identification of TGAs in hemiascomycete yeast genomes. The phylogram of the nine species considered is adapted from reference [41]. The whole genome duplication (WGD) event occurred before the divergence of S. cerevisiae and C. glabrata. The number of CDSs refers to the total number of annotated coding sequences taken into account when searching TGAs in each genome. All tandem arrays consisting of at least one CDS and one gene relic are counted as "TGAs with relic".
Figure 5
Figure 5
Distribution of TGAs according to the number and orientation of their constituent CDSs. The 469 TGAs are distributed among the nine yeast species studied. The direct orientation refers to TGAs in which all CDSs share the same orientation (sense => => or antisense <= <=). When TGAs are composed of two CDSs located on different DNA strands (convergent => <= or divergent <= =>), their orientation is called "opposite". Mixed TGAs contain at least one CDS pair in direct orientation and one CDS pair in opposite orientation. Sace: S. cerevisiae, Cagl: C. glabrata, Zyro: Z. rouxii, Klth: K. thermotolerans, Sakl: S. kluyveri, Klla: K. lactis, Asgo: A. gossypii, Deha: D. hansenii and Yali: Y. lipolytica.
Figure 6
Figure 6
Comparisons of FTB score pairs. (A) The pairs of FTB scores considered per TGA. Only TGAs not manually corrected with more than one CDS (349 TGAs in total for the 9 genomes) were retained for the analysis. For each of these TGAs, S2 and S3 correspond to FTB scores ≥ 10 associated with each pair of duplicated CDSs. Whereas S1 and S4 are FTB scores ≥ 0 of CDSs located at the extremities of the TGA. S1 concerns the region upstream from the TGA and S4 the region downstream. (B) Comparison of S2/S3 score pairs. (C) Comparison of S1/S4 score pairs.
Figure 7
Figure 7
Distribution of protein identities between pairs of homologous genes. Protein identity distribution was computed for pairs of homologs from BLASTP comparisons (see Methods) and plotted as a kernel density estimation plot. Average distributions for the nine yeast species analyzed were calculated in both cases of TGA members and other paralogous genes. All pairwise alignments between two species were performed for orthologs, but only the distributions calculated from three species were plotted as examples (the other ones showing a similar profile). KLTH: K. thermotolerans, ZYRO: Z. rouxii and SAKL: S. kluyveri.

Similar articles

Cited by

References

    1. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290(5494):1151–1155. doi: 10.1126/science.290.5494.1151. - DOI - PubMed
    1. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30(7):1575–1584. doi: 10.1093/nar/30.7.1575. - DOI - PMC - PubMed
    1. Li WH, Gu Z, Cavalcanti AR, Nekrutenko A. Detection of gene duplications and block duplications in eukaryotic genomes. J Struct Funct Genomics. 2003;3(1-4):27–34. doi: 10.1023/A:1022644628861. - DOI - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Zhang L, Gaut BS. Does recombination shape the distribution and evolution of tandemly arrayed genes (TAGs) in the Arabidopsis thaliana genome? Genome Res. 2003;13(12):2533–2540. doi: 10.1101/gr.1318503. - DOI - PMC - PubMed

Publication types

LinkOut - more resources