Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 19:9:376.
doi: 10.12688/f1000research.23180.2. eCollection 2020.

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Affiliations

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Matthew N Bernstein et al. F1000Res. .

Abstract

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

Keywords: Hackathon; Jupyter; MetaSRA; Metadata; Ontology; RNA-seq; Sequence Read Archive.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Data flows for hypothesis-driven query tools.
An overview of the backend processing functions called from the Jupyter notebooks.
Figure 2.
Figure 2.. Example results from the Case-Control Finder.
Results from running the Case-Control Finder for the query “liver cancer.” ( A) The Case-Control Finder displays the number of case/control samples matched by each tissue and cell type. ( B) The user can select either the case samples or control samples for a given tissue or cell type and display the most common ontology terms associated with those selected samples. Displayed here are the most common terms associated with the case samples labeled as “liver.” ( C) The notebook also displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right).
Figure 3.
Figure 3.. Example results from the Series Finder.
Results from running the Series Finder for the query “brain” sorted by “age,” where unit is specified as “year.” ( A) The Series Finder displays the number of samples sorted by age. ( B) The user can select samples associated with a given time point for further exploration. Here the samples annotated as “year = 63” are selected. The notebook then displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right). ( C) Given the selected samples from ( B), the notebook displays the most frequent terms associated with those selected samples.

Similar articles

Cited by

References

    1. Bairoch A: The Cellosaurus, a Cell-Line Knowledge Resource. J Biomol Tech. 2018;29(2):25–38. 10.7171/jbt.18-2902-002 - DOI - PMC - PubMed
    1. Bard J, Rhee SY, Ashburner M: An ontology for cell types. Genome Biol. 2005;6(2):R21. 10.1186/gb-2005-6-2-r21 - DOI - PMC - PubMed
    1. Bernstein M: mbernste/hypothesis-driven-SRA-queries: First release (Version v1.0.0). Zenodo. 2020. 10.5281/zenodo.3807512 - DOI
    1. Bernstein MN, Doan A, Dewey CN: MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive. Bioinformatics. 2017;33(18):2914–2923. 10.1093/bioinformatics/btx334 - DOI - PMC - PubMed
    1. Collado-Torres L, Nellore A, Kammers K, et al. : Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017;35(4):319–321. 10.1038/nbt.3838 - DOI - PMC - PubMed

Publication types

LinkOut - more resources