Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017:6:165-171.
doi: 10.1016/j.softx.2017.06.006. Epub 2017 Aug 16.

The Co-regulation Data Harvester: automating gene annotation starting from a transcriptome database

Affiliations

The Co-regulation Data Harvester: automating gene annotation starting from a transcriptome database

Lev M Tsypin et al. SoftwareX. 2017.

Abstract

Identifying co-regulated genes provides a useful approach for defining pathway-specific machinery in an organism. To be efficient, this approach relies on thorough genome annotation, a process much slower than genome sequencing per se. Tetrahymena thermophila, a unicellular eukaryote, has been a useful model organism and has a fully sequenced but sparsely annotated genome. One important resource for studying this organism has been an online transcriptomic database. We have developed an automated approach to gene annotation in the context of transcriptome data in T. thermophila, called the Co-regulation Data Harvester (CDH). Beginning with a gene of interest, the CDH identifies co-regulated genes by accessing the Tetrahymena transcriptome database. It then identifies their closely related genes (orthologs) in other organisms by using reciprocal BLAST searches. Finally, it collates the annotations of those orthologs' functions, which provides the user with information to help predict the cellular role of the initial query. The CDH, which is freely available, represents a powerful new tool for analyzing cell biological pathways in Tetrahymena. Moreover, to the extent that genes and pathways are conserved between organisms, the inferences obtained via the CDH should be relevant, and can be explored, in many other systems.

Keywords: Automation; Bioinformatics; Evolution; Protists.

PubMed Disclaimer

Figures

Figure 1
Figure 1
CDH architecture. Beginning with a single T. thermophila gene as a query, the CDH identifies all genes that are co-regulated with it, via the TetraFGD. Next, the CDH uses the TGD to gather the annotation and sequence data for each gene in the co-regulated set. For each gene in the co-regulated set, the CDH then runs forward and reciprocal BLAST searches, through the NCBI and TGD, to identify likely orthologs. A phrase matching algorithm, based on the Ratcliff-Obershelp algorithm [21], as implemented by the python difflib library, is then used to summarize the annotations of the putative orthologs for each T. thermophila gene in the co-regulated set. These summaries, which provide predictions about the function (e.g., relevant biological pathway) of the T. thermophila gene query, are presented along with the other data gathered, in the final report.
Figure 2
Figure 2
Setting CDH search parameters. The CDH is run through the terminal. The CDH prompts the user to define several parameters. These are: 1) the initial gene, i.e., the query; 2) the z-score threshold to be applied as cutoff for strength of co-regulation, which determines how many of the co-regulated genes will be subject to analysis via homology; 3) the extent to which data gathered in prior searches should be used; 4) whether results should be stored in Dropbox; 5) whether to run BLAST searches with cDNA or protein sequences; and 6) in which taxa to run the BLAST searches. For (2), the z-score threshold determines how many co-regulated genes will be included. For example, raising the threshold increases the stringency of the requirement for strength of co-regulation, so results in fewer co-regulated genes that are subsequently analyzed via BLAST, etc. For (3), the available options are: a) to run the search from scratch, overwriting any files associated with the queried gene; b) to re-use existing data for co-regulation, annotations, and sequences, but to run all of the BLAST searches from scratch; c) to re-use any existing data that are pertinent to the given query; d) to clear NCBI database errors from a previously run search and redo the associated BLAST searches; e) to only run the search for the co-regulation, annotation, and sequence. The example query in this screenshot is set to run a CDH search for the gene TTHERM 00313130 (Sortilin 4); to consider genes that are co-regulated with it with a z-score of 5 or higher; to gather all of the data de novo; to save all of the data locally; and to run the BLASTp searches only within the Ciliates.
Figure 3
Figure 3
Using CDH outputs to assess overlap in gene function. Panels A, B, and C illustrate the overlaps in co-regulated genes for three different cellular pathways: A) nuclear import and transcriptional regulation; B) programmed genome rearrangement during cell conjugation; and C) mucocyst biogenesis. Each circle in the Venn diagrams corresponds to the full set of genes, as reported by the TetraFGD, that are co-regulated with the gene indicated at the periphery of the circle. (A) NUP50 (Nucleoporin 50) plays roles both in nuclear import and in gene transcription. The dual role of NUP50 is reflected in the overlap of genes co-regulated with Importinβ (an import factor) and with RPB81 (RNA Pol II subunit), a transcription factor. NUP50, RPB81, and Importinβ are mutually co-regulated. The CDH identifies 214 genes co-regulated with the nucleoporin NUP50, 444 genes co-regulated with RPB81, and 200 genes co-regulated with Importinβ. (B) TWI1 (Tetrahymena Piwi 1), GIW1 (Gentleman in Waiting 1) and DCL1 (Dicer-like 1) are all required for programmed genome rearrangement, and are mutually co-regulated. The CDH identifies 932 genes co-regulated with TWI1, 999 genes co-regulated with GIW1, and 814 genes co-regulated with DCL1. (C) CTH3 (cathepsin 3), APM3 (μ subunit of the adaptin 3 complex), and SOR4 (sortilin 4) are all required for formation of mucocysts, and are mutually co-regulated. These genes also appear to have distinct cellular functions in addition to their roles mucocyst formation. For example, mucocysts are non-essential organelles, yet CTH3 is an essential gene. The CDH identifies 213 genes co-regulated with CTH3, 201 genes co-regulated with APM3, and 203 genes co-regulated with SOR4. (D) Pooling all of the genes represented in A, B, and C demonstrates that there is no overlap in co-regulated genes between A and B or C, and limited overlap between B and C.

Similar articles

Cited by

References

    1. Witzany G, Nowacki M, editors. Biocommunication of Ciliates. Springer; 2016.
    1. Greider CW, Blackburn EH. Identification of a specific telomere terminal transferase activity in tetrahymena extracts. Cell. 1985;43(2):405– 413. doi: http://dx.doi.org/10.1016/0092-8674(85)90170-9. - DOI - PubMed
    1. Kruger K, Grabowski PJ, Zaug AJ, Sands J, Gottschling DE, Cech TR. Self-splicing RNA: Autoexcision and autocyclization of the ribosomal RNA intervening sequence of tetrahymena. Cell. 1982;31(1):147–157. doi: 10.1016/0092-8674(82)90414-7. - DOI - PubMed
    1. Gibbons I, Rowe A. Dynein: a protein with adenosine triphosphatase activity from cilia. Science. 1965;149(3682):424–426. - PubMed
    1. Brownell JE, Zhou J, Ranalli T, Kobayashi R, Edmondson DG, Roth SY, Allis C. Tetrahymena histone acetyltransferase a: A homolog to yeast gcn5p linking histone acetylation to gene activation. Cell. 1996;84(6):843– 851. doi: http://dx.doi.org/10.1016/S0092-8674(00)81063-6. - DOI - PubMed