BIOZON: a hub of heterogeneous biological data

Aaron Birkland¹, Golan Yona

Affiliations

PMID: 16381854
PMCID: PMC1347515
DOI: 10.1093/nar/gkj153

BIOZON: a hub of heterogeneous biological data

Aaron Birkland et al. Nucleic Acids Res. 2006.

. 2006 Jan 1;34(Database issue):D235-42.

doi: 10.1093/nar/gkj153.

Authors

Aaron Birkland¹, Golan Yona

Affiliation

¹ Cornell University, Ithaca, NY, USA.

PMID: 16381854
PMCID: PMC1347515
DOI: 10.1093/nar/gkj153

Abstract

Biological entities are strongly related and mutually dependent on each other. Therefore, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze them effectively. Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein-protein interactions and cellular pathways, and establishes the relationships between them. All data are integrated on to a single graph schema centered around the non-redundant set of biological objects that are shared by each source. This integration results in a highly connected graph structure that provides a more complete picture of the known context of a given object that cannot be determined from any one source. Currently, Biozon integrates roughly 2 million protein sequences, 42 million DNA or RNA sequences, 32,000 protein structures, 150,000 interactions and more from sources such as GenBank, UniProt, Protein Data Bank (PDB) and BIND. Biozon augments source data with locally derived data such as 5 billion pairwise protein alignments and 8 million structural alignments. The user may form complex cross-type queries on the graph structure, add similarity relations to form fuzzy queries and rank the results based on analysis of the edge structure similar to Google PageRank, online at Biozon.org.

PubMed Disclaimer

Figures

**Figure 1**
A schematic representation of a section of the Biozon data graph at the instance level. Each node or edge is an instance of a class in the class hierarchy (Supplementary Figure S1). This particular section of the graph centers around a protein that has a known structure, is coded by a known DNA sequence, and is involved in an interaction. On the complete graph, all types (e.g. pathways, EC families, domains, etc.) are represented.

**Figure 2**
Mapping a SwissProt instance to the Biozon graph. Each dataset maps to a set of nodes and edges in the Biozon graph, each of which is a member of the class hierarchies presented in Supplementary Figure S1. This is the simplest representation in Biozon: one object related to one descriptor containing searchable annotation. For example, a SwissProt record of a protein is transformed into an instance of an amino acid sequence object and a SwissProt descriptor that are related together by the ‘describes’ relation.

**Figure 3**
A profile page displays an overview of the data in Biozon relating to a particular object (such as SwissProt protein UBC4_YEAST). At the top is a brief summary of some of the most generally useful attributes, followed by a physical representation (if applicable), links to relevant descriptor annotation documents, neighboring objects in the Biozon graph, and similar objects. The illustrations on the right side are a schematic representation of the information that is available either one hop or two hops away from the profile page of the object viewed. Similarity relations include similarities based on sequence or based on expression profiles of the corresponding mRNA sequences.

**Figure 4**
Complex search. This is a schematic representation of the complex query: ‘Find all structures with resolution less than 2 angstroms of proteins that are in the HIV-1 protease enzyme family’. Results are returned when there are paths conforming to the given topology through objects that satisfy all constraints (instances in the dark-shaded intersection in the middle). In this case, a set of structures will be returned, each one a member of a topology satisfying the complex search parameters.

**Figure 5**
Fuzzy search for all proteins in enzyme family 1.1.1.1 that have a known structure and are involved in an interaction. (A) Standard search (as of April 2005) returned no results: There are no proteins that are a member of all three sets (EC 1.1.1.1, have known structure, and in an interaction). (B) To form a fuzzy search, the set of proteins in EC 1.1.1.1 is extended to include those *similar* to ones in EC family 1.1.1.1. In this example the fuzzy search returns two results.

**Figure 6**
Ranking search results (A) Computing ranks: Each entity confers its weight on its neighboring entities (solid lines) with probability α, and to a random node selected from all graph nodes (dashed lines) with probability 1 − α, imitating a random walk through the document space. The computation starts with random weights and re-assigns weights until convergence. (B) We show the first five results from a search for proteins that contain the word ‘cancer’ in their definition, with and without using ranks. Each instance is shown with its context (the entities that are connected to it).

See this image and copyright information in PMC

References

1. Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. - PMC - PubMed
1. George D.G., Barker W.C., Mewes H.-W., Pfeiffer F., Tsugita A. The PIR-international protein sequence database. Nucleic Acids Res. 1996;24:17–20. - PMC - PubMed
1. Westbrook J., Feng Z., Jain S., Bhat T.N., Thanki N., Ravichandran V., Gilliland G.L., Bluhm W., Weissig H., Greer D.S., et al. The Protein Data Bank: unifying the archive. Nucleic Acids Res. 2002;30:245–248. - PMC - PubMed
1. Benson D.A., Boguski M.S., Lipman D.J., Ostell J., Ouellette B.F., Rapp B.A., Wheeler D.L. GenBank. Nucleic Acids Res. 1999;27:12–17. - PMC - PubMed
1. Bader G.D., Donaldson I., Wolting C., Ouellette B.F., Pawson T., Hogue C.W. BIND-the biomolecular interaction network database. Nucleic Acids Res. 2001;29:242–245. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BIOZON: a hub of heterogeneous biological data

Affiliation

BIOZON: a hub of heterogeneous biological data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources