Chapter 9: Analyses using disease ontologies

Nigam H Shah¹, Tyler Cole, Mark A Musen

Affiliations

PMID: 23300417
PMCID: PMC3531278
DOI: 10.1371/journal.pcbi.1002827

Chapter 9: Analyses using disease ontologies

Nigam H Shah et al. PLoS Comput Biol. 2012.

. 2012;8(12):e1002827.

doi: 10.1371/journal.pcbi.1002827. Epub 2012 Dec 27.

Authors

Nigam H Shah¹, Tyler Cole, Mark A Musen

Affiliation

¹ Center for Biomedical Informatics Research, Stanford University, Stanford, California, United States of America. nigam@stanford.edu

PMID: 23300417
PMCID: PMC3531278
DOI: 10.1371/journal.pcbi.1002827

Abstract

Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask "Which biological process is over-represented in my set of interesting genes or proteins?" we can also ask "Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?". For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases--blood coagulation disorders--that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. An overview of the process to calculate enrichment of GO categories.**
The steps usually followed are: (1) Get annotations for each gene in reference set and the set of interest. (2) Count the occurrence (n) of each GO term in the annotations of the genes comprising the set of interest. (3) Count the occurrence (m) of that same GO term in the annotations of the reference set. (4) Assess how “surprising” is it to find n, given m, M and N.

**Figure 2. Workflow schematic of enrichment analysis.**
If the input set has only textual annotations, we first run the Annotator service to create ontology-term annotations. The annotation counts in the input set are first aggregated along the ontology hierarchy and then compared with a background set for a statistically significant difference in the frequency of each ontology term. If a significant difference in the term frequency is found, that term is called “enriched” in the input set of entities. The results of the analysis are returned either as a tag-cloud, a graph, or as an XML output that users can process as required.

**Figure 3. Tag cloud output: An example for the annotations of grants from FY1981 using SNOMEDCT.**
Blue denotes low-frequency terms and red denotes highly frequent terms. Many concepts, such as “neoplasm of digestive tract”, occur at high frequencies in most years, possibly denoting the constant focus on cancer research. An appropriate background term frequency distribution is necessary to determine significance of the high frequency.

**Figure 4. The figure shows a visualization generated using the GO TermFinder tool.**
The GO graph layout shows the significantly enriched GO terms in the annotations of the analyzed gene set. The color of the nodes is an indication of their Bonferroni corrected P-value (orange < = 1e-10; yellow 1e-10 to 1e-8; green 1e-8 to 1e-6; cyan 1e-6 to 1e-4; blue 1e-4 to 1e-2; tan >0.01).

Figure 5. Workflow for generating background annotation sets for enrichment analysis: We obtain a set of PubMed articles from manually curated GO annotations, which we process using the NCBO Annotator service.

Figure 6. Disease terms significantly enriched in annotations of aging-related genes: The tag cloud shows those disease terms in the annotations of the 261 aging related genes that are statistically enriched given our gene–disease background annotation dataset.
Terms that are significantly enriched appear larger. We used a binomial test to detect enriched disease terms in the aging related gene set. Note that mis-annotated terms (such as Recruitment) and non-informative terms (such as Disease) are not deemed enriched by the statistical analysis.

See this image and copyright information in PMC

References

1. Altman RB, Raychaudhuri S (2001) Whole-genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol 11: 340–347. - PubMed
1. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS Lett 480: 17–24. - PubMed
1. Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32 Suppl: 496–501. - PubMed
1. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98: 5116–5121. - PMC - PubMed
1. Huttenhower C, Hibbs M, Myers C, Troyanskaya OG (2006) A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22: 2890–2897. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

U54 HG004028/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Chapter 9: Analyses using disease ontologies

Affiliation

Chapter 9: Analyses using disease ontologies

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources