Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 1:2019:baz132.
doi: 10.1093/database/baz132.

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Affiliations

GenoSurf: metadata driven semantic search system for integrated genomic datasets

Arif Canakoglu et al. Database (Oxford). .

Abstract

Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Logical schema of the metadata repository. Red relations represent FKs between core schema tables; blue relations link core schema values to corresponding ontology vocabulary terms. Data types are shortened: str for character varying; int for integer; and bool for Boolean. N marks nullable attributes.
Figure 2
Figure 2
Excerpt of Uberon subtree originating from the ‘uterus’ root. We only report elements that are relevant to our example.
Figure 3
Figure 3
Partition of data in the integrated repository according to (a) assemblies and datasets from the most relevant sources: (b) ENCODE; (c) Roadmap Epigenomics; (d) TCGA in the GRCh38 version provided by GDC; and (e) TCGA in the hg19 legacy data repository.
Figure 4
Figure 4
Sections of GenoSurf Web interface: (i) top menu bar; (ii) query utilities; (iii) data search; (iv) key-value search; and (v) results visualization.
Figure 5
Figure 5
Data search section of the GenoSurf Web interface, highlighting attributes within the four dimensions of the repository core schema; values are entered by users and appear in drop-down menus for easing their selection.
Figure 6
Figure 6
Key-value search result using input string ‘disease’ as a key. The keyword is matched both in the GCM attributes (for each matching attribute, we present the number of available distinct values and some example values) and in the original source attributes (each matching attribute enables exploration and selection of any corresponding values).
Figure 7
Figure 7
Example of composition of two key-value search sessions.
Figure 8
Figure 8
Excerpt of the result items table resulting from a search session. Red ellipses highlight relevant features. Top left: GMQL button to generate queries to further process related data files; DOWNLOAD buttons for result items table and data file links; and Replicated/Aggregated switch. Top right: SORT FIELDS button to customize the columns visualized in the table. Center: Extra, Source URI and Local URI columns with clickable links. Bottom right: component to set the number of rows visible at a time; indication of the total items corresponding to the performed query.
Figure 9
Figure 9
Available datasets for the performed GRCh38 TCGA Cholangiocarcinoma data search.
Figure 10
Figure 10
Example of the key-value filters needed to select triple-negative breast cancer items after using the data search interface to preliminarily select TCGA-BRCA breast items.

Similar articles

Cited by

References

    1. Bernasconi A., Ceri S., Campi A. et al. (2017) Conceptual modeling for genomics: building an integrated repository of open data In: Proceedings of Conceptual Modeling - 36th International Conference (ER 2017). Valencia, Spain, pp. 325–339.
    1. Weinstein J.N., Collisson E.A., Mills G.B. et al. (2013) The Cancer genome atlas pan-cancer analysis project. Nat. Genet., 45, 1113–1120. - PMC - PubMed
    1. Jensen M.A., Ferretti V., Grossman R.L. et al. (2017) The NCI genomic data commons as an engine for precision medicine. Blood, 130, 453–459. - PMC - PubMed
    1. Grossman R.L., Heath A.P., Ferretti V. et al. (2016) Toward a shared vision for cancer genomic data. N. Engl. J. Med., 375, 1109–1112. - PMC - PubMed
    1. The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. - PMC - PubMed

Publication types