Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 29:8:1822.
doi: 10.12688/f1000research.21027.2. eCollection 2019.

A hands-on introduction to querying evolutionary relationships across multiple data sources using SPARQL

Affiliations

A hands-on introduction to querying evolutionary relationships across multiple data sources using SPARQL

Ana Claudia Sima et al. F1000Res. .

Abstract

The increasing use of Semantic Web technologies in the life sciences, in particular the use of the Resource Description Framework (RDF) and the RDF query language SPARQL, opens the path for novel integrative analyses, combining information from multiple data sources. However, analyzing evolutionary data in RDF is not trivial, due to the steep learning curve required to understand both the data models adopted by different RDF data sources, as well as the equivalent SPARQL constructs required to benefit from this data - in particular, recursive property paths. In this article, we provide a hands-on introduction to querying evolutionary data across several data sources that publish orthology information in RDF, namely: The Orthologous MAtrix (OMA), the European Bioinformatics Institute (EBI) RDF platform, the Database of Orthologous Groups (OrthoDB) and the Microbial Genome Database (MBGD). We present four protocols in increasing order of complexity. In these protocols, we demonstrate through SPARQL queries how to retrieve pairwise orthologs, homologous groups, and hierarchical orthologous groups. Finally, we show how orthology information in different data sources can be compared, through the use of federated SPARQL queries.

Keywords: Comparative Genomics; Orthology; Resource Description Framework (RDF); SPARQL; Sequence Homology.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Simplified query graph that can be used as support for writing SPARQL queries to extract relevant information, such as proteins in a particular species.
Figure 2.
Figure 2.. A fragment of the hierarchical orthologous cluster no. 28799 in MBGD.
A cluster can consist of genes, domains (sub-genes) or further nested orthologous clusters. Multiple levels of the hierarchy may need to be traversed recursively in order to reach a given orthologous gene. For example, the gene mxa:PL1911 (highlighted in red) can be reached through the member orthologous cluster 2018-01_tax32_8537 (shown in blue). This can be achieved in SPARQL through a recursive graph pattern, using the hasHomologous* property path 1 - a graphical abstraction of the RDF representation is provided in Figure 3.
Figure 3.
Figure 3.. Directed graph abstraction of a portion of the MBGD RDF graph related to hierarchical orthologous groups.
In Figure 3, nodes are either classes or variables, and edges are RDF properties. The terms preceded by a question mark (e.g. ?gene1) represent variables assigned with either zero or more literals or URIs. Dashed edges illustrate the orth:hasHomologous property that can be stated zero or more times, recursively. URI prefixes were omitted. MBGD is gene-centric and contains taxonomic ranges where HOGs are built are not directly available in RDF - in some cases these can be extracted from the cluster URI (e.g. http://mbgd.genome.ad.jp/rdf/resource/cluster/2018-01_tax32_8537 corresponds to taxonomic identifier 32, Myxococcus). By contrast, the taxonomic information per gene entry is richer in MBGD than in OMA, including explicit Superkingdom and Phylum information. Example SPARQL queries based on this graph abstraction are provided in the “Protocols” section, as well as in the accompanying Jupyter notebook. The pairwise orthology information is not directly available (e.g. through an RDF property), but can be extracted from the Orthologs Cluster (to highlight this, the “isPairwiseOrthologous” is shown in green with a dashed arrow).
Figure 4.
Figure 4.. Directed graph abstraction of a portion of the OMA RDF graph related to hierarchical orthologous groups.
In Figure 4, dashed edges illustrate the orth:hasHomologousMember property that can be stated zero or more times, recursively. OMA is protein-centric, however the corresponding genes that encode the proteins are also available in RDF through the "is encoded by" property (a cross-reference to Ensembl identifiers is also provided). Furthermore, the taxonomic ranges where HOGs were built are asserted through the “hasTaxonomicRange” property. The pairwise orthology information is not directly available (e.g. through an RDF property), but can be extracted from the Orthologs Cluster (to highlight this, the “isPairwiseOrthologous” is shown in green with a dashed arrow). Note: URI prefixes were omitted.
Figure 5.
Figure 5.. Directed graph abstraction of a portion of the EBI RDF graph related to pairwise orthologous genes.
Moreover, as opposed to the RDF representations in OMA and MBGD, here the pairwise orthology is explicitly asserted through the “is orthologous to” property (more precisely, http://semanticscience.org/resource/SIO_000558) as shown in Figure 5. However, there is no information available regarding orthologous clusters. Moreover, the Gene class here is in fact the OBO (not ORTH) class, i.e. http://purl.obolibrary.org/obo/SO_0000704. Instances of these genes can be specified either through their cross-reference to UniProt (the http://rdf.ebi.ac.uk/terms/ensembl/DEPENDENT property) or directly through their ENSEMBL identifier, by fixing the value of ?gene to the concatenation of http://rdf.ebi.ac.uk/resource/ensembl/ and the corresponding Ensembl identifier. Finally, the taxonomic identifiers are provided via instances of the BioSource class, http://www.biopax.org/release/biopax-level3.owl#BioSource.
Figure 6.
Figure 6.. Directed graph abstraction of a portion of the OrthoDB RDF graph related to orthologous groups.
Note that the abstract relation “?gene1 isPairwiseOrthologous ?gene2” is derived by considering the concrete property path “?gene1 :memberOf / :hasMember ?gene2” that further implies the following joint triples: “?gene1 :memberOf ?group. ?group :hasMember ?gene2.”. In Figure 6, genes are direct members of OrthoGroups built at a given taxonomic level (Clade), e.g. Cyanobacteria, available through the "ogBuiltAt" property. The cross-references to UniProt (as well as Ensembl and Entrez) are available through a 2-triple pattern (for examples see “Protocols” section).

References

    1. Altenhoff AM, Gil M, Gonnet GH, et al. : Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One. 2013;8(1):e53786. 10.1371/journal.pone.0053786 - DOI - PMC - PubMed
    1. Altenhoff AM, Glover NM, Train CM, et al. : The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucleic Acids Res. 2018;46(D1):477–485. 10.1093/nar/gkx1019 - DOI - PMC - PubMed
    1. Brooksbank C, Bergman MT, Apweiler R, et al. : The European Bioinformatics Institute's data resources 2014. Nucleic Acids Res. 2014;42(Database issue):D18–25. 10.1093/nar/gkt1206 - DOI - PMC - PubMed
    1. Chiba H, Nishide H, Uchiyama I: Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data. PLoS One. 2015;10(4):e0122802. 10.1371/journal.pone.0122802 - DOI - PMC - PubMed
    1. de Farias TM, Chiba H, Fernández-Breis JT: Leveraging logical rules for efficacious representation of large orthology datasets. s.l., Proceedings of the 10th International Semantic Web Applications and Tools for Healthcare and Life Sciences (SWAT4HCLS) Conference2017. Reference Source

Publication types

LinkOut - more resources