Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jan 1;34(Database issue):D235-42.
doi: 10.1093/nar/gkj153.

BIOZON: a hub of heterogeneous biological data

Affiliations

BIOZON: a hub of heterogeneous biological data

Aaron Birkland et al. Nucleic Acids Res. .

Abstract

Biological entities are strongly related and mutually dependent on each other. Therefore, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze them effectively. Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein-protein interactions and cellular pathways, and establishes the relationships between them. All data are integrated on to a single graph schema centered around the non-redundant set of biological objects that are shared by each source. This integration results in a highly connected graph structure that provides a more complete picture of the known context of a given object that cannot be determined from any one source. Currently, Biozon integrates roughly 2 million protein sequences, 42 million DNA or RNA sequences, 32,000 protein structures, 150,000 interactions and more from sources such as GenBank, UniProt, Protein Data Bank (PDB) and BIND. Biozon augments source data with locally derived data such as 5 billion pairwise protein alignments and 8 million structural alignments. The user may form complex cross-type queries on the graph structure, add similarity relations to form fuzzy queries and rank the results based on analysis of the edge structure similar to Google PageRank, online at Biozon.org.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A schematic representation of a section of the Biozon data graph at the instance level. Each node or edge is an instance of a class in the class hierarchy (Supplementary Figure S1). This particular section of the graph centers around a protein that has a known structure, is coded by a known DNA sequence, and is involved in an interaction. On the complete graph, all types (e.g. pathways, EC families, domains, etc.) are represented.
Figure 2
Figure 2
Mapping a SwissProt instance to the Biozon graph. Each dataset maps to a set of nodes and edges in the Biozon graph, each of which is a member of the class hierarchies presented in Supplementary Figure S1. This is the simplest representation in Biozon: one object related to one descriptor containing searchable annotation. For example, a SwissProt record of a protein is transformed into an instance of an amino acid sequence object and a SwissProt descriptor that are related together by the ‘describes’ relation.
Figure 3
Figure 3
A profile page displays an overview of the data in Biozon relating to a particular object (such as SwissProt protein UBC4_YEAST). At the top is a brief summary of some of the most generally useful attributes, followed by a physical representation (if applicable), links to relevant descriptor annotation documents, neighboring objects in the Biozon graph, and similar objects. The illustrations on the right side are a schematic representation of the information that is available either one hop or two hops away from the profile page of the object viewed. Similarity relations include similarities based on sequence or based on expression profiles of the corresponding mRNA sequences.
Figure 4
Figure 4
Complex search. This is a schematic representation of the complex query: ‘Find all structures with resolution less than 2 angstroms of proteins that are in the HIV-1 protease enzyme family’. Results are returned when there are paths conforming to the given topology through objects that satisfy all constraints (instances in the dark-shaded intersection in the middle). In this case, a set of structures will be returned, each one a member of a topology satisfying the complex search parameters.
Figure 5
Figure 5
Fuzzy search for all proteins in enzyme family 1.1.1.1 that have a known structure and are involved in an interaction. (A) Standard search (as of April 2005) returned no results: There are no proteins that are a member of all three sets (EC 1.1.1.1, have known structure, and in an interaction). (B) To form a fuzzy search, the set of proteins in EC 1.1.1.1 is extended to include those similar to ones in EC family 1.1.1.1. In this example the fuzzy search returns two results.
Figure 6
Figure 6
Ranking search results (A) Computing ranks: Each entity confers its weight on its neighboring entities (solid lines) with probability α, and to a random node selected from all graph nodes (dashed lines) with probability 1 − α, imitating a random walk through the document space. The computation starts with random weights and re-assigns weights until convergence. (B) We show the first five results from a search for proteins that contain the word ‘cancer’ in their definition, with and without using ranks. Each instance is shown with its context (the entities that are connected to it).

Similar articles

Cited by

References

    1. Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. - PMC - PubMed
    1. George D.G., Barker W.C., Mewes H.-W., Pfeiffer F., Tsugita A. The PIR-international protein sequence database. Nucleic Acids Res. 1996;24:17–20. - PMC - PubMed
    1. Westbrook J., Feng Z., Jain S., Bhat T.N., Thanki N., Ravichandran V., Gilliland G.L., Bluhm W., Weissig H., Greer D.S., et al. The Protein Data Bank: unifying the archive. Nucleic Acids Res. 2002;30:245–248. - PMC - PubMed
    1. Benson D.A., Boguski M.S., Lipman D.J., Ostell J., Ouellette B.F., Rapp B.A., Wheeler D.L. GenBank. Nucleic Acids Res. 1999;27:12–17. - PMC - PubMed
    1. Bader G.D., Donaldson I., Wolting C., Ouellette B.F., Pawson T., Hogue C.W. BIND-the biomolecular interaction network database. Nucleic Acids Res. 2001;29:242–245. - PMC - PubMed

Publication types