. 2008 Oct 21:9:495.

doi: 10.1186/1471-2164-9-495.

Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources

Evert Jan Blom¹, Rainer Breitling, Klaas Jan Hofstede, Jos B T M Roerdink, Sacha A F T van Hijum, Oscar P Kuipers

Affiliations

PMID: 18939968
PMCID: PMC2585105
DOI: 10.1186/1471-2164-9-495

Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources

Evert Jan Blom et al. BMC Genomics. 2008.

. 2008 Oct 21:9:495.

doi: 10.1186/1471-2164-9-495.

Authors

Evert Jan Blom¹, Rainer Breitling, Klaas Jan Hofstede, Jos B T M Roerdink, Sacha A F T van Hijum, Oscar P Kuipers

Affiliation

¹ Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, the Netherlands. e.j.blom@rug.nl

PMID: 18939968
PMCID: PMC2585105
DOI: 10.1186/1471-2164-9-495

Abstract

Background: Despite a plethora of functional genomic efforts, the function of many genes in sequenced genomes remains unknown. The increasing amount of microarray data for many species allows employing the guilt-by-association principle to predict function on a large scale: genes exhibiting similar expression patterns are more likely to participate in shared biological processes.

Results: We developed Prosecutor, an application that enables researchers to rapidly infer gene function based on available gene expression data and functional annotations. Our parameter-free functional prediction method uses a sensitive algorithm to achieve a high association rate of linking genes with unknown function to annotated genes. Furthermore, Prosecutor utilizes additional biological information such as genomic context and known regulatory mechanisms that are specific for prokaryotes. We analyzed publicly available transcriptome data sets and used literature sources to validate putative functions suggested by Prosecutor. We supply the complete results of our analysis for 11 prokaryotic organisms on a dedicated website.

Conclusion: The Prosecutor software and supplementary datasets available at http://www.prosecutor.nl allow researchers working on any of the analyzed organisms to quickly identify the putative functions of their genes of interest. A de novo analysis allows new organisms to be studied.

PubMed Disclaimer

Figures

**Figure 1**
**Flowchart of Prosecutor**. Flowchart of the functional prediction process in Prosecutor. First, the expression profiles from DNA microarrays (1A) are used to create a correlation matrix (1B). For every gene, the correlations with the remaining genes are retrieved from the correlation matrix and sorted (1B2). The sorted gene list is used to perform an iterative Group Analysis for every functional category (1B3). The resulting p-value is indicative for the prediction of a gene as a member of a functional category (1C). At this step, the regular iGBA process ends. However, to also assess the reliability of each prediction, the following steps are added. The complete list of p-values for every functional category is sorted (1C4), after which the positions of the members of the functional category are determined (1C5). These positions are used to create ROC curves (1D; see Results section for more information concerning ROC curves). The corresponding Area Under the ROC Curve (AUC) is then used as a measure of expression coherence value of a functional category.

**Figure 2**
**Schematic overview of the additional information provided by Prosecutor**. Various layers of information are supplied for the iGBA results (2A) from Prosecutor. Predicted functional assignments for genes whose operon members are already linked to the predicted function are indicated in the results (2B). In addition, this protocol is also followed for divergent genes that share the same upstream region (in this example *pps* and *ydiA*). The operon information that is used for the genomic context analysis is also used to detect known regulatory sequences for transcriptional modules (2C). Lastly, graph visualization is used to visualize the gene redundancy of the different functional assignments of Prosecutor (2D). Nodes in the graph represent functional categories and genes. Arrows represent membership of gene nodes to a functional category node as well as the putative functional prediction of the studied gene. The members of individual categories are placed in colored aggregates. In addition to the aggregates, a colored square is placed in each gene member of a category. The squares are colored using the colors of their matching aggregates. Members of different categories can easily be distinguished using the colored squares. An example of a functional prediction found by Prosecutor for *ydiE* from *E. coli* is shown. The expression of this gene was correlated with members of various functional categories involved in the uptake of iron. In addition to the functional association with the transcriptional module Fur, the upstream region of *ydiE* also contains a putative Fur DNA binding site.

**Figure 3**
**Prediction ability of four annotation sources**. Histograms of ROC areas (Area Under the Curve) for four annotation sources for *E. coli* based on 305 microarrays (3A) compared to randomized results (3B). The real data reveal a large amount of categories with AUC values larger than 0.8, which are almost absent in randomized results. These categories are the most promising candidates for which the iGBA approach will enable confident gene assignments functional predictions. Analysis of the AUC distribution across the annotation sources shows that the "transcription module" annotation source is the most informative, i.e., contains the largest amount of categories exceeding an AUC value of 0.9 (3A). This is intuitively very convincing as shared transcriptional regulation is the basis of coexpression. In addition to ROC areas for all GO terms, we have also analyzed the distribution of ROC areas for the GO annotation source using the "gold standard" [28]. This proposed "gold standard" (GS) consists of a specific trusted set of biological processes that maps proteins to well-defined functional classes to evaluate predictions. The authors supply a set of biological processes that is based on selection by a panel of biology experts. We have included AUC results for the GO annotation for *E. coli* using the GS. Analysis of the AUC distributions shows that the distribution of relative occurrences of the GS analysis and the analysis using a fixed member cutoff is comparable.

**Figure 4**
**Prediction ability of two annotation sources for yeast**. Histograms of ROC areas (Area Under the Curve) for two annotation sources (Gene Ontology and metabolic pathways) for *S. cerevisae* based on 1079 datasets from Stanford microarray database (4A) compared to randomized results (4B). The real data reveal a large number of categories with AUC values larger than 0.8, which are almost absent in randomized results. These categories are the most promising candidates for which the iGBA approach will enable confident gene assignments of functional predictions.

See this image and copyright information in PMC

Cited by

An ontology for microbial phenotypes.
Chibucos MC, Zweifel AE, Herrera JC, Meza W, Eslamfam S, Uetz P, Siegele DA, Hu JC, Giglio MG. Chibucos MC, et al. BMC Microbiol. 2014 Nov 30;14:294. doi: 10.1186/s12866-014-0294-3. BMC Microbiol. 2014. PMID: 25433798 Free PMC article.
Discriminative local subspaces in gene expression data for effective gene function prediction.
Puelma T, Gutiérrez RA, Soto A. Puelma T, et al. Bioinformatics. 2012 Sep 1;28(17):2256-64. doi: 10.1093/bioinformatics/bts455. Epub 2012 Jul 20. Bioinformatics. 2012. PMID: 22820203 Free PMC article.
Advances in human papillomavirus detection for cervical cancer screening and diagnosis: challenges of conventional methods and opportunities for emergent tools.
Fashedemi O, Ozoemena OC, Peteni S, Haruna AB, Shai LJ, Chen A, Rawson F, Cruickshank ME, Grant D, Ola O, Ozoemena KI. Fashedemi O, et al. Anal Methods. 2025 Feb 13;17(7):1428-1450. doi: 10.1039/d4ay01921k. Anal Methods. 2025. PMID: 39775553 Free PMC article. Review.
A Fast and Reliable Pipeline for Bacterial Transcriptome Analysis Case study: Serine-dependent Gene Regulation in Streptococcus pneumoniae.
Afzal M, Manzoor I, Kuipers OP. Afzal M, et al. J Vis Exp. 2015 Apr 25;(98):52649. doi: 10.3791/52649. J Vis Exp. 2015. PMID: 25938895 Free PMC article.

References

1. Friedberg I. Automated protein function prediction-the genomic challenge. Brief Bioinform. 2006;7:225–242. doi: 10.1093/bib/bbl004. - DOI - PubMed
1. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Rückert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33:5691–5702. doi: 10.1093/nar/gki866. - DOI - PMC - PubMed
1. Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000;10:1204–1210. doi: 10.1101/gr.10.8.1204. - DOI - PMC - PubMed
1. Wu J, Hu Z, DeLisi C. Gene annotation and network inference by phylogenetic profiling. BMC Bioinformatics. 2006;7:80. doi: 10.1186/1471-2105-7-80. - DOI - PMC - PubMed
1. Wu H, Su Z, Mao F, Olman V, Xu Y. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic Acids Res. 2005;33:2822–2837. doi: 10.1093/nar/gki573. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources

Affiliation

Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources