Database of Trypanosoma cruzi repeated genes: 20,000 additional gene variants

Erik Arner¹, Ellen Kindlund, Daniel Nilsson, Fatima Farzana, Marcela Ferella, Martti T Tammi, Björn Andersson

Affiliations

PMID: 17963481
PMCID: PMC2204015
DOI: 10.1186/1471-2164-8-391

Database of Trypanosoma cruzi repeated genes: 20,000 additional gene variants

Erik Arner et al. BMC Genomics. 2007.

. 2007 Oct 26:8:391.

doi: 10.1186/1471-2164-8-391.

Authors

Erik Arner¹, Ellen Kindlund, Daniel Nilsson, Fatima Farzana, Marcela Ferella, Martti T Tammi, Björn Andersson

Affiliation

¹ Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden. erik.arner@ki.se

PMID: 17963481
PMCID: PMC2204015
DOI: 10.1186/1471-2164-8-391

Abstract

Background: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred.

Results: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22,640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40,000.

Conclusion: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.

PubMed Disclaimer

Figures

**Figure 2**
**Distribution of estimated copy numbers**. The estimated copy number is calculated for each annotation by averaging the alignment depth along the annotation and dividing the average by 7, the average shotgun coverage. The distributions of all annotations (red), of all hypothetical genes (green) and of all trans-sialidase annotations (blue) are shown. The distribution shows the number of annotations (y-axis) for each estimate (x-axis). The graph to the left (A) shows the peak of the two first distributions at 2. The graph to the left is a zoomed in version of the higher estimates. The average estimated copy number of the trans-sialidases is 16.

**Figure 3**
**Coverage of two trans-sialidases**. A, B shows two contigs with annotated putative trans-sialidases. C, D show the coverage every 100 bp along the genes. Tc00.1047053511875 (A, C) has an average shotgun depth of 9, indicating only one copy in the genome. Tc00.1047053511105.60 has an average depth of 103, indicating 15 copies actually being present in the genome. This example shows how trans-sialidases in *T. cruzi* can be both unique (Tc00.1047053511875.20), sequence similarity wise, or closely resemble many others (Tc00.1047053511105.60).

**Figure 4**
**Active and inactive copies of trans-sialidase**. Comparison of two trans-sialidase repeat groups in DNPTrapper. Boxes indicate reads, colored dots indicate DNPs. Only part of the alignment is shown. The reads have been clustered in DNPTrapper based on their DNP content, with reads sharing similar DNP patterns being grouped together. The lower group contains a C – T base substitution (circled) that corresponds to a Tyr – His substitution in the protein, rendering this repeat copy to lose its trans-sialidase activity.

**Figure 5**
**Protein sequence alignments of gene with transmembrane regions**. Protein sequence alignment of 17 good coverage groups, from EAN81429.1. The names of the amino acid sequences to the left represent the read group their consensus sequence was derived from. Boxes show a region of the sequences with two predicted transmembrane helixes (TMH2 and TMH3). The arrows indicate positions inside the TMH regions where there is an amino acid change. It is worth to notice that most of the differences are seen in TMH2 but not as much in TMH3. Identical residues are shaded. Left numbers show the sequence position.

See this image and copyright information in PMC

References

1. Ji Y, Eichler EE, Schwartz S, Nicholls RD. Structure of Chromosomal Duplicons and their Role in Mediating Human Genomic Disorders. Genome Res. 2000;10:597–610. doi: 10.1101/gr.10.5.597. - DOI - PubMed
1. Bussey KJ, Chin K, Lababidi S, Reimers M, Reinhold WC, Kuo WL, Gwadry F, Kouros-Mehr H, Fridlyand J, Jain A, Collins C, Nishizuka S, Tonon G, Roschke A, Gehlhaus K, Kirsch I, Scudiero DA, Gray JW, Weinstein JN. Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Mol Cancer Ther. 2006;5:853–867. doi: 10.1158/1535-7163.MCT-05-0155. - DOI - PMC - PubMed
1. Eichler EE. Segmental duplications: what's missing, misassigned, and misassembled-and should we care? Genome Res. 2001;11:653–656. doi: 10.1101/gr.188901. - DOI - PubMed
1. Salzberg SL, Yorke JA. Beware of mis-assembled genomes. Bioinformatics. 2005;21:4320–4321. doi: 10.1093/bioinformatics/bti769. - DOI - PubMed
1. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.GR-1871R. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Database of Trypanosoma cruzi repeated genes: 20,000 additional gene variants

Affiliation

Database of Trypanosoma cruzi repeated genes: 20,000 additional gene variants

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources