. 2008 Aug 13;3(8):e2946.

doi: 10.1371/journal.pone.0002946.

A Semantic Web management model for integrative biomedical informatics

Helena F Deus¹, Romesh Stanislaus, Diogo F Veiga, Carmen Behrens, Ignacio I Wistuba, John D Minna, Harold R Garner, Stephen G Swisher, Jack A Roth, Arlene M Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H Vogel, Jonas S Almeida

Affiliations

PMID: 18698353
PMCID: PMC2491554
DOI: 10.1371/journal.pone.0002946

A Semantic Web management model for integrative biomedical informatics

Helena F Deus et al. PLoS One. 2008.

. 2008 Aug 13;3(8):e2946.

doi: 10.1371/journal.pone.0002946.

Authors

Affiliation

¹ Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, United States of America.

PMID: 18698353
PMCID: PMC2491554
DOI: 10.1371/journal.pone.0002946

Abstract

Background: Data, data everywhere. The diversity and magnitude of the data generated in the Life Sciences defies automated articulation among complementary efforts. The additional need in this field for managing property and access permissions compounds the difficulty very significantly. This is particularly the case when the integration involves multiple domains and disciplines, even more so when it includes clinical and high throughput molecular data.

Methodology/principal findings: The emergence of Semantic Web technologies brings the promise of meaningful interoperation between data and analysis resources. In this report we identify a core model for biomedical Knowledge Engineering applications and demonstrate how this new technology can be used to weave a management model where multiple intertwined data structures can be hosted and managed by multiple authorities in a distributed management infrastructure. Specifically, the demonstration is performed by linking data sources associated with the Lung Cancer SPORE awarded to The University of Texas MD Anderson Cancer Center at Houston and the Southwestern Medical Center at Dallas. A software prototype, available with open source at www.s3db.org, was developed and its proposed design has been made publicly available as an open source instrument for shared, distributed data management.

Conclusions/significance: The Semantic Web technologies have the potential to addresses the need for distributed and evolvable representations that are critical for systems Biology and translational biomedical research. As this technology is incorporated into application development we can expect that both general purpose productivity software and domain specific software installed on our personal computers will become increasingly integrated with the relevant remote resources. In this scenario, the acquisition of a new dataset should automatically trigger the delegation of its analysis.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Three generations of design patterns for web-based applications.**
The original design (“1.0”) consists of collections of hypertext documents that are syntactically (dashed lines) interoperable (traversing between them by clicking on the links), regardless of the domain content. The user centric web 2.0 applications use internal representations of the external data structures. This representation is asynchronously updated from the reference resources which are now free to have a specialized interoperation between domain contents. An example of this approach is that followed by AJAX-based interfaces. Finally, the ongoing emergence of the semantic web promises to produce service oriented systems that are semantically interoperable such that the interface application reacts to domains of knowledge specifically. At this level all applications tend to be web-interoperable with peer-to-peer architectures complementing the client-server design of w1.0 and w2.0.

**Figure 2. Evolution of formats for individual datasets.**
Hexagons, rectangles and small circles indicate data elements, respectively, attributes, their values, and relations. First, flat file formats such as fasta or the GeneBank data model were proposed to collect attribute-value pairs about an individual data entry. The use of tagging by extended markup languages (XML) allowed for the embedding of additional detail and further definition of the nature of the hierarchical structure between data elements. More recently, the resource description framework (RDF) further generalized the XML tree structure into that of a network where the relationship between resources (nodes) is a resource itself. Furthermore, the referencing of each resource by a unique identifier (URI) implies that the data elements can be distributed between distinct documents or even locations.

Figure 3. Illustration of the desirable functionality: distinct users, with identities (solid icon) managed in distinct S3DB deployments (circular compartments), which they control separately, share a distributed and overlapping data structure (arrows between symbols) that they also manage independently: some data elements are shared (mixed color symbols) others are not.
This will require the identity verification to propagate between deployments peer-to-peer (P2P, dotted lines), including to deployments where neither user maintains an identity (dotted circular compartment). This is in contrast with the conventional approach of having distinct users manage insular deployments with permissions managed at the access point level.

**Figure 4. Core model developed for S3DB (supported by version 3.0 onwards).**
This diagram can be read starting from the most fundamental data unit, the Attribute-Value pair (filled hexagonal and square symbols). Each element of the pair is object of two distinct triples, one describing the domain of discourse, the *Rules*, and the other made of *Statements* where that domain is populated to instantiate relationships between entities. The latter includes the actual Values. Surrounding these two nuclear collection of triples, is the resolution of *Collection* and its instantiation as *Item* that define the relationship between the individual elements of *Rules* and *Statements*. The resulting structure is then organized in *Projects* in such a way that the domain of discourse can nevertheless be shared with other *Projects*, in the same or in a distinct deployment of S3DB. Finally, a propagation of user permissions (dashed line) is defined such that the distribution of the data structures can be traced. See text for a more detailed description.

**Figure 5. Snapshots of interfaces using S3DB's API (Application Programming Interface).**
These applications exemplify why the semantic web designs can be particularly effective at enabling generic tools to assist users in exploring data documenting very specific and very complex relationships. Snapshot A was taken from S3DB's web interface, which is included in the downloadable package. This interface was developed to assist in managing the database model and, therefore, is centered on the visualization and manipulation of the domain of discourse, its *Collections of Items* and *Rules* defining the documentation of their relations. The application depicted on snapshots B–D describe a document management tool S3DBdoc, freely available as a Bioinformatics Station module (see Figure 6). The navigation is performed starting from the Project (C), then to the *Collection* (B) and finally to the editing of the *Statements* about an *Item* (D). The snapshot B illustrates an intermediate step in the navigation where the list of *Items* (in this case samples assayed by tissue arrays, for which there is clinical information about the donor) is being trimmed according to the properties of a distant entity, Age at Diagnosis, which is a property of the Clinical Information *Collection* associated with the sample that originated the array results. This interaction would have been difficult and computationally intensive to manage using a relational architecture. The RDF formatted query result produced by the API was also visualized using a commercial tool, Sentient Knowledge Explorer (IO-Informatics Inc), shown in snapshot E, and by Welkin, developed by the digital inter-operability SIMILE project at the Massachusetts Institute of Technology. See text for discussion of graphic representations by these tools. To protect patient confidentiality some values in snapshots B and D are scrambled and numeric sample and patient identifiers elsewhere are altered.

**Figure 6. Prototype infrastructure for integrated data management and analysis being tested by the Univ. Texas Lung cancer SPORE.**
The system is based on two components, a network of universal semantic database servers and a code distribution server that delivers applications in response to the use of ontology. Four distinct user cases are represented, a–d, which rely on a combination of download of interpreted code (green arrows) or direct access to web-based graphic user interfaces or web-based API (blue arrows, in the latter case using Representational State Transfer, REST). The dotted lines represent regular updating of the application, propagating improvements in the application code.

See this image and copyright information in PMC

References

1. Blake JA, Bult CJ. Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform. 2006;39:314–320. - PubMed
1. Komatsoulis GA, Warzel DB, Hartel FW, Shanbhag K, Chilukuri R, et al. caCORE version 3: Implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 2007 - PMC - PubMed
1. Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, et al. Advancing translational research with the Semantic Web. BMC Bioinformatics. 2007;8(Suppl 3):S2. - PMC - PubMed
1. Brazhnik O, Jones JF. Anatomy of data integration. J Biomed Inform. 2007;40:252–269. - PMC - PubMed
1. Hendler J. Communication. Science and the semantic web. Science. 2003;299:520–521. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Semantic Web management model for integrative biomedical informatics

Affiliation

A Semantic Web management model for integrative biomedical informatics

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources