Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Sep 11:8:332.
doi: 10.1186/1471-2105-8-332.

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

Affiliations

How to decide which are the most pertinent overly-represented features during gene set enrichment analysis

Roland Barriot et al. BMC Bioinformatics. .

Abstract

Background: The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered.

Results: We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons).

Conclusion: The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The processing of a query in a feature enrichment search engine. (a) a query set is submitted to search for similar sets in (b) a set of target sets. Sets can include each other: this is represented by a graph in which nodes represent sets and edges indicate the inclusion of a set into another. (c) the query set is compared to all the target sets based on a similarity model. (d) target sets found similar are returned ordered by decreasing similarity.
Figure 2
Figure 2
Examples of neighborhood relationships and target sets. (a) Expression profiles: target sets correspond to sets of genes having similar expression profiles, i.e. nodes of the binary tree resulting from a hierarchical clustering of the profiles. (b) Chromosome localization: sets of adjacent genes correspond to the nodes of an implicit lattice resulting from the order of the genes on the chromosome. (c) Gene Ontology annotations: target sets correspond to GO terms, i.e. genes are grouped in a set corresponding to a term when they are annotated with this term or a more specific one.
Figure 3
Figure 3
Illustration of the pertinence rules. Rule 2 allows the selection of smaller sets containing the same elements in common (fewer differing elements): for T1 and T4, Rule 2 does not hold because there exists smaller sets (T2 and T5) containing the same elements in common. Rule 3 allows the selection of bigger sets containing the same differing elements: for T3 and T6, Rule 3 does not hold because there exists bigger sets (T2 and T5) containing the same differing elements. T2 and T5 are pertinent.
Figure 4
Figure 4
Algorithm 1. Algorithm 1 for the identification of pertinent target sets in the DAG of a neighborhood (see figure 1b) and their comparison to a given query set. It is a slightly modified version of a multiple sources breadth first search.
Figure 5
Figure 5
Pertinence depends on query elements. Only a small fraction of all the target sets may be pertinent for a given query: (a) sets that have no elements in common with the query are not pertinent. (b) large sets (much bigger than the query) having elements in common with the query are not pertinent because they contain too much differing elements. (c) small target sets that have elements in common with the query may be pertinent.
Figure 6
Figure 6
Generic compact representation of a neighborhood. Instead of explicitly storing all the set compositions as illustrated in figure 1b, only the size of the set corresponding to a node is stored. Nodes corresponding to singletons sets (leaf nodes) are labeled with the element of the set. The set corresponding to a particular node can be generated on the fly by searching the leaf nodes reachable from this set. During the bottom-up search of pertinent target sets, elements in common with the query set Q = {b, g} are propagated. i1..i7 correspond to iterations of the main loop (order in which the nodes are processed).
Figure 7
Figure 7
Algorithm 2. Algorithm 2 for the identification and the comparison of pertinent target sets in the generic compact representation (see example in figure 6) for their comparison to a given query set. The node data structure has the following fields: tag for the potentially pertinent tag, children and parent that point to child and parent nodes, common which is the set of common elements with the query, #differing for the number of elements differing with the query set, size for the size of the set this node represents.
Figure 8
Figure 8
Part of the Gene Ontology DAG containing hits listed in table 4. Partial view of the DAG of the Gene Ontology biological processes branch concerning hits of complex 440.30.10 of the MIPS listed in table 4. Pertinent terms are in bold, non pertinent terms due to Rule 2 are in normal shape and non pertinent terms due to Rule 3 are in italic. The number of proteins annotated with the term in the complex is given between parenthesis together with the total number of proteins annotated with this term (target set size).

References

    1. Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. doi: 10.1093/bioinformatics/bti565. - DOI - PMC - PubMed
    1. The Gene Ontology Consortium Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005;33:D154–159. - PMC - PubMed
    1. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. - DOI - PMC - PubMed
    1. Danchin A. The Delphic boat: what genomes tell us translated by Alison Quayle. Cambridge, MA: Harvard University Press; 2002.

Publication types

MeSH terms

LinkOut - more resources