. 2019 Jan 1:2019:baz132.

doi: 10.1093/database/baz132.

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Arif Canakoglu¹, Anna Bernasconi¹, Andrea Colombo¹, Marco Masseroli¹, Stefano Ceri¹

Affiliations

PMID: 31820804
PMCID: PMC6902006
DOI: 10.1093/database/baz132

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Arif Canakoglu et al. Database (Oxford). 2019.

. 2019 Jan 1:2019:baz132.

doi: 10.1093/database/baz132.

Authors

Arif Canakoglu¹, Anna Bernasconi¹, Andrea Colombo¹, Marco Masseroli¹, Stefano Ceri¹

Affiliation

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milan, Italy.

PMID: 31820804
PMCID: PMC6902006
DOI: 10.1093/database/baz132

Abstract

Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.

PubMed Disclaimer

Figures

**Figure 1**
Logical schema of the metadata repository. Red relations represent FKs between core schema tables; blue relations link core schema values to corresponding ontology vocabulary terms. Data types are shortened: str for character varying; int for integer; and bool for Boolean. N marks nullable attributes.

**Figure 2**
Excerpt of Uberon subtree originating from the ‘uterus’ root. We only report elements that are relevant to our example.

**Figure 3**
Partition of data in the integrated repository according to (a) assemblies and datasets from the most relevant sources: (b) ENCODE; (c) Roadmap Epigenomics; (d) TCGA in the GRCh38 version provided by GDC; and (e) TCGA in the hg19 legacy data repository.

**Figure 4**
Sections of GenoSurf Web interface: (i) top menu bar; (ii) query utilities; (iii) data search; (iv) key-value search; and (v) results visualization.

**Figure 5**
Data search section of the GenoSurf Web interface, highlighting attributes within the four dimensions of the repository core schema; values are entered by users and appear in drop-down menus for easing their selection.

**Figure 6**
Key-value search result using input string ‘disease’ as a key. The keyword is matched both in the GCM attributes (for each matching attribute, we present the number of available distinct values and some example values) and in the original source attributes (each matching attribute enables exploration and selection of any corresponding values).

**Figure 7**
Example of composition of two key-value search sessions.

**Figure 8**
Excerpt of the result items table resulting from a search session. Red ellipses highlight relevant features. Top left: GMQL button to generate queries to further process related data files; DOWNLOAD buttons for result items table and data file links; and Replicated/Aggregated switch. Top right: SORT FIELDS button to customize the columns visualized in the table. Center: Extra, Source URI and Local URI columns with clickable links. Bottom right: component to set the number of rows visible at a time; indication of the total items corresponding to the performed query.

**Figure 9**
Available datasets for the performed GRCh38 TCGA Cholangiocarcinoma data search.

**Figure 10**
Example of the key-value filters needed to select triple-negative breast cancer items after using the data search interface to preliminarily select TCGA-BRCA breast items.

See this image and copyright information in PMC

References

1. Bernasconi A., Ceri S., Campi A. et al. (2017) Conceptual modeling for genomics: building an integrated repository of open data In: Proceedings of Conceptual Modeling - 36th International Conference (ER 2017). Valencia, Spain, pp. 325–339.
1. Weinstein J.N., Collisson E.A., Mills G.B. et al. (2013) The Cancer genome atlas pan-cancer analysis project. Nat. Genet., 45, 1113–1120. - PMC - PubMed
1. Jensen M.A., Ferretti V., Grossman R.L. et al. (2017) The NCI genomic data commons as an engine for precision medicine. Blood, 130, 453–459. - PMC - PubMed
1. Grossman R.L., Heath A.P., Ferretti V. et al. (2016) Toward a shared vision for cancer genomic data. N. Engl. J. Med., 375, 1109–1112. - PMC - PubMed
1. The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Affiliation

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases