Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jan 25;13 Suppl 1(Suppl 1):S9.
doi: 10.1186/1471-2105-13-S1-S9.

Federated ontology-based queries over cancer data

Federated ontology-based queries over cancer data

Alejandra González-Beltrán et al. BMC Bioinformatics. .

Abstract

Background: Personalised medicine provides patients with treatments that are specific to their genetic profiles. It requires efficient data sharing of disparate data types across a variety of scientific disciplines, such as molecular biology, pathology, radiology and clinical practice. Personalised medicine aims to offer the safest and most effective therapeutic strategy based on the gene variations of each subject. In particular, this is valid in oncology, where knowledge about genetic mutations has already led to new therapies. Current molecular biology techniques (microarrays, proteomics, epigenetic technology and improved DNA sequencing technology) enable better characterisation of cancer tumours. The vast amounts of data, however, coupled with the use of different terms - or semantic heterogeneity - in each discipline makes the retrieval and integration of information difficult.

Results: Existing software infrastructures for data-sharing in the cancer domain, such as caGrid, support access to distributed information. caGrid follows a service-oriented model-driven architecture. Each data source in caGrid is associated with metadata at increasing levels of abstraction, including syntactic, structural, reference and domain metadata. The domain metadata consists of ontology-based annotations associated with the structural information of each data source. However, caGrid's current querying functionality is given at the structural metadata level, without capitalising on the ontology-based annotations. This paper presents the design of and theoretical foundations for distributed ontology-based queries over cancer research data. Concept-based queries are reformulated to the target query language, where join conditions between multiple data sources are found by exploiting the semantic annotations. The system has been implemented, as a proof of concept, over the caGrid infrastructure. The approach is applicable to other model-driven architectures. A graphical user interface has been developed, supporting ontology-based queries over caGrid data sources. An extensive evaluation of the query reformulation technique is included.

Conclusions: To support personalised medicine in oncology, it is crucial to retrieve and integrate molecular, pathology, radiology and clinical data in an efficient manner. The semantic heterogeneity of the data makes this a challenging task. Ontologies provide a formal framework to support querying and integration. This paper provides an ontology-based solution for querying distributed databases over service-oriented, model-driven infrastructures.

PubMed Disclaimer

Figures

Figure 1
Figure 1
caBIG® semantic infrastructure core services. Figure 1: caGrid core services, and their corresponding APIs, matched with the different levelsofthe metadata hierarchy. At the syntactic level, caGrid counts with XML Schemas to indicate the data types shared on the grid. These schemas are maintained in the Global Model Exchange, a service acting as an XML schema registry. The structural metadata is conformed by UML models, which can be accessed using the caGrid Discovery API. A metadata registry, based on the ISO/IEC 11179 standard, is used to manage common data elements (CDEs). The metadata registry, called caDSR, can be accessed with a specific API. A CDE is composed of an object class, a property and a value domain. These components correspond to a UML class, UML attribute and the attribute's data type, respectively, and each of them is associated with a set of concepts from an ontology. These mappings between structural elements and concepts constitute the reference metadata. The concepts are part of the domain metadata, and in caBIG® mainly belong to the NCI thesaurus ontology. The LexEVS API allows to access the available terminologies.
Figure 2
Figure 2
caBIG® semantic infrastructure and semantic layer built in our COnQueST system. Figure 2: Different levels of metadata in the caBIG® semantic infrastructure shown for two data sources that share a common data element (CDE). The CDE is annotated with concepts from the NCI thesaurus ontology. The top part of the diagram (above the dotted line) shows the ontologies built in the COnQueST system to facilitate ontology-based queries over caBIG® data services.
Figure 3
Figure 3
DCQL Use Case. Figure 3: Sections of the UML models of the caBIO and PIR data services showing the classes annotated with concepts included in the second query use case. This diagram corresponds to a solution of the query reformulation process involving multiple data services.
Figure 4
Figure 4
Use Case. Figure 4: Section of the caBIO UML model representing the relationship between the SNP class, corresponding to single nucletoide polymorphisms and the Chromosome class. This section of the UML model is relevant for the first query use case, where the solution involves a single target data service.
Figure 5
Figure 5
System Architecture. Figure 5: The caGrid service-oriented architecture (bottom part) extended with novel semantic services for the generation of ontologies and querying and a bespoke user interface (shown in the upper part of the diagram)
Figure 6
Figure 6
Screenshot of the browser tool in COnQueSt interface. Figure 6: The browser tool in COnQueSt interface: the upper left panel shows the list of projects (or information models) available, the bottom left panel shows the concepts used to annotate the selected project, and the right panel allows to view the concepts definitions, including links to the NCI thesaurus browser. All panels have a searching facility: for instance, it is possible to search projects by their name.
Figure 7
Figure 7
Screenshot of the search tool in COnQueSt interface. Figure 7: COnQueSt search tool: when searching for a concept, the result shown includes the projects (or information models) with matching concepts as well as the concepts themselves. While the "search" button considers all the concepts containing the search criteria, the "I'm feeling lucky" button retrieves the concept that matches exactly the search criteria.
Figure 8
Figure 8
Screenshot of the query builder in COnQueSt interface. Figure 8: The COnQueSt query builder tool allows to search available concepts and to specify an association between them, to indicate that has a property specified by another concept or indicate a specific value.
Figure 9
Figure 9
Screenshot of the results panel in COnQueSt interface. Figure 9: The query results panel shows a table listing the properties of each result object.
Figure 10
Figure 10
The path metrics. Figure 10: Three box plot diagrams showing path metrics for each subset of information models: caDSR, caGrid and Info Models. The path metrics considered are, from left to right: the longest path, the average number of nodes per path and the average number of paths per journey.
Figure 11
Figure 11
Ontology and modules, generation and inference times. Figure 11: The box plot diagram on the left shows the generation times for the NCIt module ontology and the annotated UML ontology for the three groups of information models (caDSR, caGrid and Info Models). The box plot diagram on the left depicts the inference times for the UML ontology using Hermit and Pellet reasoners. Both diagrams use logarithmic scale.
Figure 12
Figure 12
Query rewriting performance. Figure 12: Times taken in each stage of the query reformulation process (parsing, UML extraction, path finding, MCC conversion and CQL conversion) at varying path lengths.
Figure 13
Figure 13
Path finding performance. Figure 13: Path-finding times for varying numbers of explanations, ranging from 1 to 5. Each explanation generates a path.
Figure 14
Figure 14
Query reformulation stages. Figure 14: The stages of query rewriting for both single and multiple target data services are depicted in blue. The form of the query at the different stages is represented in yellow and in red, we show the points of user interaction.
Figure 15
Figure 15
Federated path finder. Figure 15: Processes involved in finding paths in the information models when dealing with queries over multiple data services.

Similar articles

Cited by

References

    1. NCRI Informatics Initiative. http://www.cancerinformatics.org.uk/
    1. caBIG® Programme. https://cabig.nci.nih.gov/
    1. ONIX. http://www.ncri-onix.org.uk/
    1. Saltz J, Oster S, Hastings S, Langella S, Kurc T, Sanchez W, Kher M, Manisundaram A, Shanbhag K, Covitz P. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics. 2006;22:1910–1916. doi: 10.1093/bioinformatics/btl272. - DOI - PubMed
    1. Tobias J, Chilukuri R, Komatsoulis GA, Mohanty S, Sioutos N, Warzel DB, Wright LW, Crowley RS. The CAP cancer protocols-a case study of caCORE based data standards implementation to integrate with the Cancer Biomedical Informatics Grid. BMC Med Inform Decis Mak. 2006;6:25–25. doi: 10.1186/1472-6947-6-25. - DOI - PMC - PubMed

Publication types

LinkOut - more resources