Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Oct 27;11(10):e0165395.
doi: 10.1371/journal.pone.0165395. eCollection 2016.

TCGA Expedition: A Data Acquisition and Management System for TCGA Data

Affiliations

TCGA Expedition: A Data Acquisition and Management System for TCGA Data

Uma R Chandran et al. PLoS One. .

Abstract

Background: The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices.

Results: TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable.

Conclusion: Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. TCGA Expedition file name components.
Fig 2
Fig 2. TCGA Expedition Directory Structure.
Fig 3
Fig 3. Architecture of the Pittsburgh Genome Resource Repository.

References

    1. van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30(9):418–26. Epub 2014/08/12. 10.1016/j.tig.2014.07.001 . - DOI - PubMed
    1. Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif). 2013;6:287–303. 10.1146/annurev-anchem-062012-092628 . - DOI - PubMed
    1. Cancer Genome Atlas Project (TCGA) [cited 2015 December 9]. Available from: http://cancergenome.nih.gov/.
    1. Network T. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474:609–15. 10.1038/nature10166 - DOI - PMC - PubMed
    1. Network T. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. Epub 2012/09/25. 10.1038/nature11412 ; PubMed Central PMCID: PMCPmc3465532. - DOI - PMC - PubMed