Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 15;29(10):1333-40.
doi: 10.1093/bioinformatics/btt141. Epub 2013 Apr 17.

A self-updating road map of The Cancer Genome Atlas

Affiliations

A self-updating road map of The Cancer Genome Atlas

David E Robbins et al. Bioinformatics. .

Abstract

Motivation: Since 2011, The Cancer Genome Atlas' (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium's (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Contact: robbinsd@uab.edu.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Class diagram of the schema used to represent TCGA files in RDF. Each class may have the tcga:lastSeen, tcga:lastModified, tcga:firstSeen, tcga:url and rdfs:label properties. In addition to these, tcga:File resources have properties indicating the particular instances of the other resource types they belong to, such as the tcga:Archive resource they are contained in. The properties for linking tcga:Files to other classes are generated by using a lower case version of the class name (e.g. tcga:archive links a tcga:File and a tcga:Archive)
Fig. 2.
Fig. 2.
Example representation of metadata about a file in the TCGA using our schema. This portion of the RDF graph shows a file for the platform mda_rppa_core, from M.D. Anderson (mdanderson.org), which is a cancer genome characterization center (cgcc) in the glioblastoma disease study (gbm)
Fig. 3.
Fig. 3.
Flowchart representation of the algorithm used to scrape and update the TCGA open-access HTTP site into an RDF road map via a SPARQL endpoint
Fig. 4.
Fig. 4.
Snapshot of a dashboard element showing the logarithmic progression of the number of files available to the public in the TCGA, counted using the SPARQL endpoint here reported, showing a sustained doubling every 7 months since March 2010, and now over ½ million individual files in the public domain. The query used to generate the results may be accessed at http://bit.ly/FilesByDate, and a live automatically updating version of this figure is available at http://bit.ly/TCGARoadmap
Fig. 5.
Fig. 5.
Snapshot of a dashboard element showing the results of querying the TCGA contents for relationships between platforms and disease studies, shown here as a bipartite graph. In this figure, lines between disease studies (on the left) and data analysis platforms (on the right) indicate that the disease study contains files generated by the linked platform. Uterine corpus endometrioid carcinoma (coded ucec in TCGA) is highlighted, as well as the lines for the 17 platforms used to process samples within that study. The query used to retrieve this data is available at http://bit.ly/PlatformsByDisease, with an interactive version of the visualization of the data available at http://bit.ly/TCGARoadmap

References

    1. Almeida JS. Computational ecosystems for data-driven medical genomics. Genome Med. 2010;2:67. - PMC - PubMed
    1. Almeida JS, et al. S3DB core: a framework for RDF generation and management in bioinformatics infrastructures. BMC Bioinformatics. 2010;11:387. - PMC - PubMed
    1. Almeida JS, et al. Fractal MapReduce decomposition of sequence alignment. Algorithms Mol. Biol. 2012a;7:12. - PMC - PubMed
    1. Almeida JS, et al. ImageJS: personalized, participated, pervasive, and reproducible image bioinformatics in the web browser. J. Pathol. Inform. 2012b;3:25. - PMC - PubMed
    1. Baggerly KA, Berry DA. Reproducible research. AMSTAT News. 2011:pp. 16–17.

Publication types

LinkOut - more resources